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A METHOD FOR IDENTIFYING A SUBSET OP COMPONENTS OF A SYSTEM 
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FIELD OF THE INVENTION 

The present, invention relates to a method and apparatus for 
identifying components of a system from data generated from 
samples from the system, which components are capable of 
predicting a feature of the sample within the system and, 
particularly, but not exclusively, the present invention 
relates to a method and apparatus for identifying components 
of a biological system from data generated by a biological 
method, which components are capable of predicting, a feature 
of interest associated with a sample applied to the 
biological system. 

BACKGROUND OF THE INVENTION 
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There are any number of "systems" in existence which can be 
classified into different features of interest. The term 
"system" essentially includes all types of systems for which 
data can be provided, including chemical .systems, financial 
systems (e.g. credit systems for individuals, groups or 
organisations, loan histories) , geological systems, and many 
HK^re. It is desirable to be able to utilise data generated 
from the systems (e.g. statistical data) to identify 
particular features of san5)les from the system (e.g. to 
assist with analysis of financial system to identify the 
groups which exist in the financial system (e.g. in very 
simple terms those who have "good- credit and those who are 
a credit risk) . where there is a large amount of 
statistical data, the identification of components from that 
data which are predictive of a particular feature of a 
sample from the system is a difficult task, generally 



because there is a large amount of data to process, the 
majority of which may not provide any indication or little 
indication of the features of a particular sample from which 
the data is taken. In addition, components that are 
identified using training sample are often ineffective at 
identifying feattires on test sanples data when the test 
sample data has a high degree of variability relative to the 
training sample data. This is often the case in situations 
when, for exan5)le, data is obtained from many different 
sources, as it is often impossible to control the conditions 
under which the data is collected from each individual 
source . 

An example of a type of system where these problems is 
particularly pertinent, is a biological system and the 
following description refers specifically to biological 
systems. The present invention is not limited to use with 
biological systems, however, and it has general application 
to any system. 

Recent advances in biotechnology have resulted in the 
development of biological methods for large scale screening 
of systems and analysis of samples. Such methods include, 
for example, DNA, rna or antibody microarray analysis, 
proteomics analysis, proteomics electrophoresis gel analysis 
and high throughput screening techniques. These types of 
methods often result in the generation of data that can have 
up to 30,000 or more components for each sample that is 
tested. 

It is obviously important to be able identify features of 
interest in samples from biological systems. For example, 
to classify groups such as "diseased" and "non-diseased" . 




- 4 - 

Many of these biological methods would be useful as 
diagnostic tools predicting features of a sample in the 
biological systems (e.g. for identifying diseases by 
screening tissues or body fluids, or as tools for 
5 determining, for exair^jle, the efficacy of pharmaceutical 
compounds) . 
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Use of biological methods such as biotechnology arrays in 
such applications to date has been limited owing to the 
large amount of data that is generated from these types of 
• methods, and the lack of efficient methods for screening the 
data for meaningful results. Consequently, analysis of 
biological data using prior art methods is time consuming, 
prone to false positive and negative results and requires 
large amounts of computer memory if a meaningful result is 
to be obtained from the data. This is problematic in large 
scale screening scenarios where rapid and accurate screening 
is required. 

There is therefore a need for an improved method, in 
particular for analysis of biological data, and, more 
generally, for an improved method of analysing data from any 
system in order to predict a feature of interest for a 
sample from the system. 

StlMMARY OP THE INVENTION 
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In a first aspect, the invention provides a method for 
identifying a subset of components of a system from data 
generated from the system from a plurality of samples from 
the system, the subset being capable of predicting a feature 
of a test sample, the method comprising the steps of; 



(b) 
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(a) generating a linear combination of components and 

component weights in which values for each component 
are introduced from data generated from a plurality of 
training samples, each training sample having a known 
feature ; 

defining a model for the probability distribution of a 
feature wherein the model is conditional on the linear 
combinat i on ; 

(c) constructing a prior distribution for the component 
weights of the linear combination comprising a 
hyperprior having a high probability density close to 
zero, wherein the hyperprior is not a Jeffreys 
hyperprior ; 

(d) combining the prior distribution and the model to 
generate a posterior distribution; 

(e) identifying a subset of components having component 
weights that maximise the posterior distribution. 

The method utilises training samples having a known feature 
in order to identify a subset of components v^ich can 
predict a feature for a training sample. Subsequently, 
knowledge of the subset of components can be used for tests, 
for example clinical tests, to predict a feature such as 
whether a tissue sample is malignant or benign, or what is 
the weight of a tumour, or provide an estimated time for 
survival of a patient having a particular condition. As used 
herein, the term "feature" refers to any response or 
identifiable trait or character that is associated with a 
sample. For example, a feature may be a particular time to 
an event for a particular sample, or the size or quantity of 
a sample, or the class or group into which a sample can be 
classified. 



The method of the present invention estimates the component 
weights utilising a Bayesian statistical method. 
Preferably, where there are a large amount of components 
generated from the system (which will usually be the case 
for the method of the present invention to be effective) the 
method preferably makes an apriori assumption that the 
majority of the components are unlikely to be components 
that will form part of the suJbset of components for 
predicting a feature. The assumption is therefore made that 
the majority of corrqoonent weights are likely to be zero. A 
model is constructed which, with this assumption in mind, 
sets the component weight so that the posterior probability 
of the weights given the observed data is maximised. 
Components having a weight below a pre-determined threshold 
(which will be the majority of them in accordance with the 
apriori assumption) are dispensed with. The process is 
iterated until the correct diagnostic components are 
identified. This method is quick, mainly because of the 
apriori assumption which results in rapid elimination of the 
majority of components. 



Preferably, the hyperprior has adjustable parameters which 
enable the prior probability density at zero to be varied. 

Most features of a system typically exhibit a probability 
distribution, and the probability distribution of a feature 
can be modelled using statistical models which are based on 
the data generated from the training samples. The method of 
the invention utilises statistical models which model the 
probability distribution for a feature of interest or a 
series of features of interest. Thus, for a feature of 
interest having a particular probability distribution, an 
appropriate model is defined that models that distribution. 
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The method may use any model that is conditional on the 
linear combination, and is preferably a mathematical 
equation in the form of a likelihood function that provides 
a probability distribution based on the data obtained from 
the training samples. Preferably, the likelihood function is 
based on a previously described model for describing some 
probability distribution. In one embodiment, the model is a 
likelihood function based on a model selected from the group 
consisting of a multinomial or binomial logistic regression, 
generalised linear model. Cox's proportional hazards model, 
accelerated failure model and parametric survival ' model . 

In one embodiment, the likelihood function is based on a 
multinomial or binomial logistic regression. The binomial 
or multinomial logistic regression preferably models a 
feature having a multinomial or binomial distribution. A 
binomial distribution is a statistical distribution having 
two possible classes or groups such as an on/off state. 
Examples of such groups include dead/alive, improved/not 
improved, depressed/not depressed. A multinomial 
distribution is a generalisation of the binomial 
distribution in which a plurality of classes or groups are 
possible for each of a plurality of samples, or in other 
words, a sample may be classified into one of a plurality of 
classes or groups. Thus, by defining a likelihood function 
based on a multinomial or binomial logistic regression, it 
is possible to identify subsets of components that are 
capable of classifying a sample into one of a plurality of 
pre-defined groups or classes. To do this, training samples 
are grouped into a plurality of sample groups (or "classes") 
based on a predetermined feature of the training samples in 
which the members of each sample group have a common feature 
and are assigned a common group identifier. A likelihood 
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function is formulated based on a multinomial or binomial 
logistic regression conditional on the linear combination 
(which incorporates the data generated from the grouped 
training samples) . The feature may be any desired 
classification by which the training san5>les are to be 
grouped. For example, the features for classifying tissue 
samples may be that the tissue is normal, malignant, benign, 
a leukemia cell, a healthy cell, that the training samples 
are obtained from the blood of patients having or not having 
a certain condition, or that the training samples are from a 
cell from one of several types of cancer as compared to a 
normal cell. 



Preferably, the likelihood function based on the logistic 
15 regression is of the form: 
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wherein 

xffig ±a a linear combination generated from input data from 
training sample i with component weights yff^; 

xf is the components for the i** Row of X and Pg is a set of 
con5)onent v/eights for sample class g-; and 

X is data from n training samples comprising p components. 
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In another embodiment, the likelihood function is based on 
an ordered categorical logistic regression. The ordered 
categorical logistic regression models a binomial or 
multinomial distribution in which the classes are in a 
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particular order (ordered classes such as for example, 
classes of increasing or decreasing disease severity) . By 
defining a likelihood function based on an ordered 
categorical logistic regression, it is possible to identify 
a subset of components that is capable of classifying a 
sample into a class wherein the class is one of a plurality 
of predefined ordered classes. By defining a series of 
group indentifiers in which each group identifier 
corresponds to a member of an ordered class, and grouping 
the training samples into one of the ordered classes based 
on predetermined features of the training samples, a 
likelihood function can be formulated based on a categorical 
ordered logistic regression which is conditional on the 
linear combination (which incorporates the data generated 
from the grouped training samples) . 

Preferably, the likelihood function based on the categorical 
ordered logistic regression is of the form: 



20 Wherein 
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Yi. is the probability that training sanple 1 belongs to a 
class with identifier less than or equal to ic (where the 
total of ordered classes is ic) .The r^ is defined further in 
the document. 

In another embodiment of . the present invention, the 
likelihood function is based on a generalised linear model. 
The generalised linear model preferably models a feature 
that is distributed as a regular exponential family of 
distributions. Examples of regular exponential family of 



distributions include normal distribution, guassian 
distribution, poisson distribution, gamma distribution and 
inverse gamma distribution. Thus, in another embodiment of 
the method of the invention, a siabset of components is 
identified that is capable of predicting a predefined 
characteristic of a sample which has a distribution 
belonging to a regular exponential family of distributions. 
In particular by defining a generalised linear model which 
models the characteristic to be predicted. Examples of a 
characteristic that may be predicted using a generalised 
linear model include any quantity of a sample that exhibits 
the specified distribution such as, for example, the weight, 
size or other dimensions or quantities of a sample. 

Preferably, the generalised linear model is of the form: 
L = log p(y I p, <p) =2{^^^^9^+c(x.q>) } 

where y « (y^ y„)T ^^d ai(<^) = 0 /wi with the Wi being a 

fixed set of known weights and <f> a single scale parameter. 
The other terms in this expression are defined later in this 
document . 



In another embodiment, the method of the present invention 
may be used to predict the time to an event for a sample by 
utilising a likelihood function based on a hazard model 
which preferably estimates the probability of a time to an 
event given that the event has not taken place at the time 
of obtaining the data. In one embodiment, the likelihood 
function is based on a model selected from the group 
consisting of Cox's proportional hazards model, parametric 
survival model and accelerated failure times model. Cox's 
proportional hazards model permits, the time to an event to 



be modelled on a set of components and component weights 
without making restrictive assumptions about time. The 
accelerated failure model is a general model for data 
consisting of. survival times in which the component 
measurements are assumed to act raultiplicatively on the 
time-scale, and so affect the rate at which an individual 
proceeds along the time axis. Thus, the accelerated 
survival model can be interpreted in terms of the speed of 
progression of, for example, disease. The parametric 
survival model is one in which the distribution function for 
the time to an event (eg survival time) is modelled by a 
known distribution or has a specified parametric 
fo2:mulation. Among the commonly used survival distributions 
are the Weibull, exponential and extreme value 
distributions . 



Preferably, a subset of components capable of predicting the 
time to an event for a sample is identified by defining a 
likelihood based on Cox's proportional standards model, a 
parametric survival model or an accelerated survival times 
model, which comprises measuring the time elapsed for a 
plurality of samples from the time the sample is obtained to 
the time of the event. 

Preferably, the likelihood function for predicting the time 
to an event is of the form: • 

N 

Log ( Partial) Likelihood = ^ g^ (/3.(p;X,y,c) 

1=1 

where f ^{pi,P2. ".Pp) and f --{<Pi.<P2.'",(Pq)^XQ the model 
parameters, y is a vector of observed times and c is an 
indicator vector which indicates whether a time is a true 
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survival time or a censored survival time. 



Preferably, the likelihood function based on Cox's 
proportional hazards model is of the form: 



/(£i^)=n 



exp (Z j/3) 
2 i^ifi ) 



where the observed times are be ordered in increasing 
magnitude denoted as i = (\iy\2y 'i(N))- %M)>\i)' and we denote 
by Zthe Nxp matrix that is the re -arrangement of the rows 
of X where the ordering of the rows of Z corresponds to the 
ordering induced by the ordering of £ . We also have 

P =(A'^2'--.Ai), Zj = the J** row of Z, and 

9ly=: {z. x=7,y+l.—,iV}= the risk set at the j"* ordered event 

time . 

Preferably the likelihood function based on the Parametric 
Survival model is of the form: 



N 

1=1 
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log 
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where Mi=A[yi:<p)exp[Xifi) and A denotes the integrated 
parametric hazard function. 



For any defined models, the component weights are typically 
estimated using a Bayesian statistical model (Kot25 and 
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Johnson, 1983) in which a posterior distribution of the 
component weights is formulated which combines the 
likelihood function and a prior distribution. The component 
weights are estimated by maximising the posterior 
distribution of the weights given the data generated for 
each training sample. Thus, the objective function to be 
maximised consists of the likelihood function based on a 
model for the featxire as discussed above and a prior 
distribution for the weights. 

Preferably, the prior distribution is of the form: 



Wherein v is a p x 1 vector of hyperparameters , and where 
p(fi \y') is i\^(0,diag{v2}) and p(v') is some hyperprior 
distribution for . This hyperprior distribution (which is 
preferably the same for all embodiments of the method) may 
be expressed using different notational conventions, and in 
the detailed description of the preferred embodiments (see 
below) , the following notational conventions are adopted 
merely for convenience for the particular preferred 
embodiment : 

As used herein, when the likelihood function for the 
probability distribution is based on a multinomial or 
binomial logistic regression, the notation for the prior 
distribution is: 
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Where ^(0[ ,...fil.,) and ^(tj ,..:,tU). 

ana p[pJ\tI) is iv(0,diag{T^}) and is some hyperprior 

distribution for t| . ' 

As used herein, when the likelihood function for the 
probability distribution is based on a categorical ordered 
logistic regression, the notation for the prior distribution 
is : 

X 1=1 

Where are component weights, i^(A|^f) isiV(o,vj^)and 

P(vf) some hyperprior distribution for Vf. 

As used herein, when the likelihood function for the 
distribution is based on a generalised linear model, the 
notation for the prior distribution is: 

wherein v is a p x 1 vector of hyperparameters , and where 
p(p is -A^(o,diag{i/2}) and p(v^) is some prior distribution 
for v^. 
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As used herein, when the likelihood function for the 
distribution is based on a hazard model, the notation for 
the prior distribution is: 

where p[fi*\r ) is iv(o.diagp}) and p{r ) some hyperprior 
distribution for r . 



P{^')= ! P{fi*\r )p{t )dT 

T 

The prior distribution comprises a hyperprior that ensures 
that zero weights are preferred whenever possible. 

In another embodiment, the hyperprior is an inverse gamma 
distribution in which each r,'=l/v,' has an independent gamma 
distribution . 

In a further embodiment, the hyperprior is a gamma 
distribution in which each , t, or rf (depending on the 
context) has an independent gamma distribution. 

As discussed above, the prior distribution and the 
likelihood- fvmction are combined to generate a posterior 
distribution. , The posterior distribution is preferably of 
the form: 



p{P9y\y)a L{y\fi<p)p(fi\v)p{v) 
Or 
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wherein is the likelihood function. 

The component weights in the posterior distribution are 
estimated in an iterative procedure such that the 
probability density of the posterior distribution is 
maximised. During the iterative procedure, component weights 
having a value less than a pre-determined threshold are 
eliminated, preferably by setting those component weights to 
zero. This results in elimination of the corresponding 
component . 

Preferably, the iterative procedxire is an EM algorithm. The 
EM algorithm produces a sequence of cornponent weight 
estimates that converge to give component weights that 
ma.ximise the probability density of the posterior 
distribution. The EM algorithm consists of two steps, known 
as the E or Expectation step and the M, or Maximisation 
step. In the E step, the expected value of the log- 
posterior function conditional on the linear combination is 
determined. In the M step, the expected log-posterior 
function is maximised to give updated component weight 
estimates that increase the posterior. The two steps are 
alternated until convergence of the E step and the M step is 
achieved, or in other words, until the expected value and 
the maximised value of the log-posterior function converge. 

It is envisaged that the method of the present invention may 
be applied to any system from which measurements can be 
obtained, and preferably systems from which very large 
amounts of data are generated. Examples of systems to which 
the method of the present invention may be applied include 
biological systems, chemical systems, agricultural systems, 
weather systems, financial systems including, for example. 
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credit risk assessment systems, insurance systems, marketing 
systems or company record systems, electronic systems, 
• physical systems, astrophysics systems and mechanical 
systems. For example, in a financial system, the Samples 
may be particular stock and the components .may be 
measurements made on any number of factors which may affect 
stock prices such as company profits, employee numbers, 
rainfall values in various cities, number of shareholders 
etc. 

The method of the present invention is particularly suitable 
for use in analysis of biological systems . The method of 
the present invention may be used to identify subsets of 
components for classifying samples from any biological 
system which produces measurable values for the components 
and in which the components can be uniquely labelled. in 
other words, the components are labelled or organised in a 
manner which allows data from one component to be 
distinguished from data from another component. For 
example, the components may be spatially organised in, for 
example, an array which allows data from each component to 
be distinguished from another by spatial position, or each 
component may have some unique identification associated 
with it such as an identification signal or tag. For 
example, the con^ionents may be bound to individual carriers, 
each carrier having a detectable identification signature 
such as quantum dots (see for example, Rosenthal,- 2001, 
Nature Biotech 19: 621-622; Han et al . (2001) Nature 
Biotechnology 19: 631-635), fluorescent markers (see for 
example, Pu et al, (1999) Nature Biotechnology 17 : 1109- 
1111) , bar-coded tags (see for example, Lockhart and trulson 
(2001) Nature Biotechnology 19: 1122-1123). 
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In a particularly preferred embodiment, the biological 

system is a biotechnology array. Examples of biotechnology 

arrays include oligonucleotide arrays, DNA arrays, DMA 

microarrays, RNA arrays, RNA microarrays, DNA microchips, 

RNA microchips, protein arrays, protein microchips, antibody 

arrays, chemical arrays, carbohydrate arrays, proteomics 

arrays, lipid arrays. In another embodiment, the biological 

system may be selected from the group including, for 

example, DNA or RNA electrophoresis gels, protein or 

proteomics electrophoresis gels, biomolecular interaction 

analysis such as Biacore analysis, amino acid analysis, 

ADMETox screening (see for example High -throughput ADMETox 

estimation: In Vitro and In Silico approaches (2002) , Ferenc 

Darvas and Gyorgy Dorman (Eds) , Biotechniques Press) , 

protein electrophoresis gels and proteomics electrophoresis 
gels. 

The components may be any measurable component of the 
system. in the case of a biological system, the components 
may be, for example, genes or portions thereof, DNA 
sequences, RNA sequences, peptides, proteins, carbohydrate 
molecules, lipids or mixtures thereof, physiological 
components, anatomical components, epidemiological 
components or chemical components. 

The training samples may be any data obtained from a system 
in which the feature of the sample is knovm. For example, 
training samples may be data generated from a sample applied 
to a biological system. For example, when the biological 
system is a DNA microarray, the training sample may be data 
obtained from the array following hybridisation of the array 
with RNA extracted from cells having a known feature, or 
cDNA synthesised from the RNA extracted from cells, or if 
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the biological system is a proteomics electrophoresis gel, 
the training sample may be generated from a protein or cell 
extract applied to the system. 

The inventors envisage that the method of the present 
invention may be used in one embodiment in re-evaluating or 
evaluating test data from subjects who have presented mixed 
results in response to a test treatment. Thus, in a second 
aspect, the present invention provides a method for 
identifying a subset of components of a subject which are 
capable of classifying the subject into one of a plurality 
of predefined groups wherein each group is defined by a 
response to a test treatment comprising the steps of: 



5 (a) 



(b) 
(c) 



exposing a plurality of subjects to the test treatment 
and grouping the subjects into response groups based 
on responses to the treatment; 
measuring components of the subjects; 

identifying a subset of components that is capable of 
classifying the subjects into response groups using a 
statistical analysis method. 

Preferably, the statistical analysis method is a method 
according to the first aspect of the invention. 

Once a subset of components has been identified, that subset 
can be used to classify subjects into groups such as those 
that are likely to respond to the test treatment and those 
that are not. In this manner, the method of the present 
invention permits treatments to be identified which may be 
effective for a fraction of the population, and permits 
identification of that fraction of the population that will 
be responsive to the test treatment. 
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In a third aspect, the present invention provides an 
apparatus for identifying a subset of components of a 
subject, the subset being capable of classifying the subject 
into one of a plurality of predefined response groups 
wherein each response group is formed by exposing a 
plurality of subjects to a test treatment and grouping the 
subjects into response groups based on the response to the 
treatment, the apparatus conprising; 

(a) means for receiving measured components of the subjects; 

(b) means for identifying a subset of components that is 
capable of classifying the subjects into response groups 
using a statistical analysis method. 

Preferably, the statistical analysis method is the method 
according to the first or second aspect. 

in a fourth aspect, the present invention provides a method 
20 for identifying a subset of components of a subject which 
are capable of classifying the subject as being responsive 
or non-responsive to treatment with a test compound 
comprising the steps of: 
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25 (a) 



(b) 
(c) 

30 



exposing a plurality of subjects to the compound and 

grouping the subjects into response groups based on 

each subjects response to the compound; 

measuring components of the subjects; 

identifying a subset of components that is capable of 

classifying the subjects into response groups using a 

statistical analysis method. 
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Preferably, the statistical analysis method is the method 
according to the first aspect. 



LO 



In a fifth aspect, the present invention provides an 
apparatus for identifying a subset of components of a 
siabject, the subset being capable of classifying the subject 
into one of a plurality of predefined response groups 
wherein each response group is formed by exposing a 
plurality of subjects to a compound and grouping the 
subjects into response groups based on the response to the 
compound, the apparatus comprising; 



(a) means for receiving measured components of the 
stibjects; 

L5 (b) means for identifying a subset of components that is 
capable of classifying the subjects into response 
groups using a statistical analysis method. 

Preferably, the statistical analysis method is the method 
according to the first or second aspect of the invention. 

The components that are measured in the second to fifth 
aspects of the -invention may be, for example, genes or small 
nucleotide polymorphisms (SNPs) , proteins, antibodies, 
carbohydrates, lipids or any other measurable component of 
the subject. 



In a particularly preferred embodiment, the compound is a 

pharmaceutical compound or a composition comprising a 

pharmaceutical compound and a pharmaceutical ly acceptable 
carrier . 
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The identification method of the present invention may be 
implemented by appropriate con5>uter software and hardware. 



In accordance with a sixth aspect, the present invention 
provides an apparatus for identifying a subset of components 
of a system from data generated from the system from a 
plurality of samples from the system, the subset being 
capable of predicting a feature of a test sample, the 
apparatus comprising; 

(a) means for generating a linear combination of components 
and component weights in which values for each 
component are introduced from data generated from a 
plurality of training samples, each training sample 
having a known feature; 

means for defining a model for the probability 
distribution of a feature wherein the model is 
conditional on the linear combination; 
means for constructing a prior distribution for the 
component weights of the linear combination con?)rising 
an adjustable hyperprior which allows the prior 
probability mass close to zero to be varied wherein the 
hyperprior is not a Jeffrey's hyperprior; 
means for combining the prior distribution and the 
model to generate a posterior distribution; 
(e) means for identifying a subset of components having 
component weights that maximize the posterior 
• distribution. 

The apparatus may comprise an appropriately programmed 
confuting device. 



(b) 



(c) 



(d) 
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In accordance with a seventh aspect, the present invention 
provides a computer program arranged, when loaded onto a 
computing apparatus, to control the computing apparatus to 
implement a method in accordance with the first aspect of 
the present invention. 

The computer program may implement any of the preferred 
algorithms and method steps of the first or second aspect of 
the present invention which are discussed above. 

in accordance with a eighth aspect of the present invention, 
there is provided a computer readable medium providing a 
computer program in accordance with the fourth aspect of the 
present invention, 

in accordance with a ninth aspect of the present invention, 
there is provided a method of testing a sample from a system 
to identify a feature of the sample, the method comprising 
the steps of testing for a subset of components which is 
diagnostic of the feature, the subset of components having 
bee^ determined by a method in accordance with the first or 
second aspect of the present invention. 

Preferably, the system is a biological system. 

in accordance with a tenth aspect of the present invention, 
there is provided an apparatus for testing a sample from a 
system to determine a feature of the sample, the apparatus 
including means for testing for components identified in 
accordance with the method of the first or second aspect of 
the present invention. 
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In accordance with an eleventh aspect, the present invention 
provides a computer program which when run on a computing 
device, is arranged to control the comptiting device, in a 
method of identifying conponents from a system which are 
capable of predicting a feature of a test sample from the 
system, and wherein a linear combination of components and 
component weights is generated from data generated from a 
plurality of training samples, each training sample having a 
known feature, and a posterior distribution is generated by 
combining a prior distribution for the component weights 
comprising an adjustable hyperprior which allows the 
probability mass close to zero to be varied wherein the 
hyperprior is not a Jeffrey's hyperprior, and a model that 
is conditional on the linear combination, to estimate 
component weights which maximise the posterior distribution. 

VWiere aspects of the present invention are implemented by 
way of a computing device, it will be appreciated that any 
appropriate computer hardware e.g. a PC or a mainframe or a 
networked computing infrastructure, may be used. 

in a twelfth aspect, the present invention provides a method 
for identifying a subset of components of a biological 
system, the subset being capable of predicting a feature of 
a test sample from the biological system, the method 
comprising the steps of: 

(a) generating a linear combination of con^jonents and 

component weights in which values for each component 
are determined from data generated from a plurality of 
training samples, each training sample having a known 
feature ; 



(b) 



(c) 



(d) 

10 (e) 



10 
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defining a model for the probability distribution of a 
feature wherein the model is conditional on the linear 
combination; 

constructing a prior distribution for the conponent 
weights of the linear co„4>ination comprising an 
adjustable hyperprior which allows the probability 
mass close to zero to be varied; 

confining the prior distribution and the model to 
generate a posterior distribution; 

identifying a subset of components having conponent 
weights that maximize the posterior distribution. 

DCTAII.ED DBSCRlPTIca, OF THE PHBPBEBED EMBODIMENT 

The present Invention Identifies preferably a relatively 
small number of components which can be used to identify 
Whether a particular training sanple has a particular 
feature The components are -diagnostic" of that feature, 
or enable discrimination between samples having a different 
feature. The number of components selected by the method can 
be controlled by the choice of parameters in the hyperprior 
.ssentxally. from all the data which is generated ZJ^^J' 
system, the method of the present Invention enables 
identification of a relatively small nuoO^er of components 
whxch can be used to test for a particular feature. Once 
those con^onents have been identified by this method, the 
conponents can- be used in future to assess new samples. «,e 
-thod Of the present invention utilises a statlstLal- 
method to eliminate con5,onents that are not required to 
correctly predict the feature. 

The inventors have found that cc^onent weights of a linear 
=ombxnat.on of components of data generated from the 
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training sanples can be estimated in such a way as to 
eliminate the components that are not required to correctly 
predict the feature of the training sample. The result is 
that a subset of components are identified which can 
correctly predict the feature of the training sample. The 
method Of the present invention thus permits identification 
from a large amount of data a relatively small and 
controllable number of components which are capable of 
correctly predicting a feature. 

The method of the present invention also has the advantage 
that at requires usage of less computer memory than prior 
art methods which use joint rather than marginal information 
on components. Accordingly, the method of the present 

invention can be tjerfoi-mftri •».=^^^t 

e perrormed rapidly on computers such as, for 

example, laptop machines i 

cnines. By using less memory, the method 

Of the present invention also allows the method to be 

performed more quickly than prior art methods which use 

Doint (rather than marginal) information on components for 

analysis of, for example, biological data. 

A first embodiment relating to a multiclass logistic 
regression model will now be described. 



25 A. Multi Class Logistic regression 



model 



30 



The method of this e,^i™ent utilises the training .anples 
in order to Identify a subset of conponents which can 
classify the training samples Into pre-defined groups 
subsequently, knowledge of the subset of components can be 
used for tests, for example clinical teats, to classify 
sables into groups such as disease classes. For exa^le, a 
subset of components of a DlOV microarray may be used to 
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group clinical sainples Into o]<r><^.ii,, _ , 

V ~ inco Clinically relevant classes such 

as, for exansjle, healthy or diseased. 

m this way. the present invention identifies preferably a 
s»«ll and controllable number of components which can be 
used to identify whether a particular training sample 
belongs to a particular group. The selected con^nents are 
"diagnostic" of that group, or enable discrimination between 
groups. Essentially, from all the data which is generated 
from the system, the method of the present invention enables 
xdentification of a small nun^r of co„ponentB which can be 
used to test for a particular group. Once those components 
have been identified by this method, the components can be 
used in future to classify new san^les into the groups. The 
n«thod of the present invention preferably utilises a 
statistical method to eliminate components that are not 
retired to correctly identify the group the san^le belongs 

The sanples are grouped into sample groups (or -classes-) 
based on a pre-determined classification. The classification 
n«y be any desired classification by which the training 
samples are to be grouped. Por example, classification 
»y be Whether the training seniles are from a leukemia cell 
or a healthy cell, or that the training samples are obtained 
from the blood of patients having or not having a certain 
condition, or that the training seniles are from a cell from 
one of several types of cancer as compared to a normal cell. 

in one embodiment, the input data is organised into an 
"xz-data matrix X={x„) with „ training samples and p 
components. Typically, p will be much greater than „ 
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in another enO^odiment, data matrix X may be replaced by an n 
X n kernel matrix K to obtain smooth functions of x as 
predictors instead of linear predictors. An example of the 
kernel matrix K is k.,=exp (-0 .5* (x.-x,)^ (x.-x,) /a^) where the 
5 subscript on x refers to a row number in the matrix X 

Ideally, subsets of the columns of K are selected which give 
sparse representations of these smooth functions. 

Associated with each sample class (group) may be a class 
10 label y,, where y,=k,k^{^,,,^G}. ^^^^^ indicates which of G 
sample classes a training sample belongs to. We write the 
»xl vector with elements y, as y. . Given the vector y we 
can define indicator variables 



15 ^^^j^ y.-g 
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otherwise (A1) 



in one embodiment, the component weights are estimated using 
a Bayesian statistical model (see Kotz and Johnson, 1983) 
Preferably, the weights are estimated by maximising the 
posterior distribution of the weights given the data 
generated from each training sample. This results in an 
objective function to be maximised consisting of two parts 
The first part a likelihood function and the second a prior 
distribution for the weights which ensures that zero weights 
are preferred whenever possible. In a preferred embodiment, 
the likelihood function is derived from a multiclass 
logxstic model. Preferably, the likelihood function is 
computed from the probabilities: 
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7 ^» .T„ \'S =^1,:.,G-1 (A2) 



V A=I J 



and 



t-^^t; X- (A3) 



5 wherein 



p.. Is the probability that the training ea^le with input 
data Ak will be in sample class s! 

^Ps is a linear oorabination generated from input data from 
10 training sample i with component weights fi,, 

4 is the components for the i» Row of ^ ar,/i R ■ 

«.ow or ^ and Pg is a set of 

conponent weights for saitple class g, 

^ically, as discussed above, the conponent weights are 
estimated in a manner which takes into account the apriori 
assu^tion that most of the con^nent weights are zero. 

in one embodiment, components weights fig m e<paation (A2) 
are estimated in a manner whereby most of the values are 
zero, yet the samples can still be accurately classified. 

in one embodiment, the component weights are estimated by 
™sing the posterior distribution of the weights given 
the data in the Bayesian model referred to above. 

i 5 

Preferably, the component weights are estimated by 
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(a) specifying a hierarchical prior for the component 
weights >S;,...,^^_, ; and 

(b) specifying a likelihood function for the input data; 

(c) determining the posterior distribution of the weights 
5 given the data using (A5) ; and 

(d) determining component weights which maximise the 
posterior distribution. 

In one embodiment, the hierarchical prior specified for the 
10 parameters fi^^ is of the form: 

P(A..., A.-,)= \^PW:)P{T'>^dr- (A4) 



where /?'=(A^-..^-0. r- = ....... ^_,), ;^r(o,diag{rJ}) 

15 p{tI) is a suitable prior. 



and 



In one embodiment, /.(Tj)=n/>(4) where is a prior 

wherein t^'=l/z-4 ^as an independent gamma distribution. 

20 In another- embodiment, ^(r^) is a prior wherein has an 
independent gamma distribution. 



In one embodiment, the likelihood function is of 

the form in ecjuation (8) and the posterior distribution of 
25 and r given y is 



(A5) 
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In one embodiment, the likelihood function has 
second derivative. 



a first and 



5 in one embodiment, the first derivative is determined from 
the following algorithm: 



(A6) 



10 wherein ^ = (e- i = l rt\ r,^-(^ • ^ \ 

^'-'")>Pg-{Pig,i-=h...,n) are vectors indicating 

membership of sample class g and probability of class ^ 
respectively. 

in one e„4,odimant, the second derivative is determined from 
15 the following algorithm: 



where 5^ is l if h equals g and zero otherwi 



(A7) 



20 



Equation A6 and equation A7 may be derived 



se. 



as follows: 



25 



(a) Using equations (Al) , (a2) and (A3), the likelihood 
function of the data can be written as : 



r 


• 


n 










V 5=1 



A=l J 



(A8) 
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(b) Taking logs of equation {A6) and using the fact that 

G 

for all i gives: 



10 



15 



^«g^= E -log l+X^^''^ (A9) 
(c) Differentiating equation (A8) with respect to Pg gives 
dfig ~ y^e-Pgh S = l G-l (A10) 

Whereby =(e,„i =l.n) , = ^re vectors indicating 

mendDership of sample class flr and probability of class g 
respectively. 

(d) The second derivative of equation (9) has elements 



where 



20 A. =J^* ^^S' 

otherwise 
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Component weights which maximise the posterior distribution 
of the likelihood function may be specified using an EM 
algorithm comprising an E step and an M step. 

in conducting the EM algorithm, the E step preferably 
comprises the step computing a term of the form 
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C-l n 



where 4 =^{l/< I 4>-^ ^.K^.^.....^,/ and 4=l/i, = 0 if ^ = 0. 

Preferably, equation (lia) is computed by calculating the 
conditional expected value of t,^=i/,^^^ ,i,en is 
N{0,Ti) and ^(r^) has a specified prior distribution. 

Explicit formulae for the conditional expectation will be 
presented later. 

Typically, the EM algorithm comprises the steps: 

(a) performing an E step by calculating the 

conditional expected value of the posterior 
distribution of component weights using the 
function: 

Q-Q(r\y.r) ='»gi-igr; <*.i.^K(r.)}-V. (au) 

Where =^P,r^ in equation (8), d(f,)= P^d^ , and 

is defined as in equation (iia) evaluated at 
^g=Pgfg' Here is a matrix of zeroes and ones 
derived from the identity matrix such that P/^ff^ 
selects non-zero elements of which are denotld 
by y^. 



(b) 
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performing an M step by applying an iterative 
procedure to maximise Q as a function of y 
whereby : 



(A13) 



where or' is a step length such that O^a'^l; and 

Y = (Yg. g^l, .G-1) . 

Equation (A12) may be derived as follows: 

calculate the conditional expected value of (A5) given the 
observed data y and a set of parameter estimates /? . 

consider the case when components of (and ^ ) are set to 
zero i.e for ^ = 1,..., G-1, ,9, =P^y^ and ^,=P^f,. 

ignoring terms not involving /and using (A4) , (AS), (A9) we 
get: 



Q = logL 



(A14) 
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where in (A8). ^^iy^^^pr^^ .^.ere 4 is defined as 

in equation (Alia) evaluated at B =Pv 

S Note that the conditional expectation can be evaluated from 
fxrst principles given (A4) . SoMe e:<pllclt egressions are 
given later. 



The iterative procedure may be derived as follows, 
TO Obtain the derivatives retired In <ll, . first note that 
fro. <A8,, ,As, and (Aao, writing ^(«.{rf.(^.U=l,....G-l, . 
get 



we 



15 



-^C-l(%-I-/><7-l) 



(A15) 



and 



20 



9y 



1 11-^1 ... yl, 



(A16) 



- 36 



■ft, 

^(r)=W^),^=i,...,G-i) 



where ^gh^V 

otherwise 



and 



10 



■^s -^g^^'g^ l»...G-\ (A17) 

in a preferred embodiment, the iterative procedure may be 
simplified by using only the block diagonals of equation 
(A16) in equation (A13) . For ^=1.....G-1. this gives: 

Rearranging equation (A18) leads to 



where 



Writing p{g) for the number of columns of . {a19) requires 

the inversion of a p{g) xp(^) matrix which may be quite 

large. This can be reduced to an n^n matrix for p{jg)>n by 
noting that: 
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(i;Xy,+/r'=/-i';(iii'/+A-)-'ii 

where Z^=A^y,, Preferably, (A19) is used when p{g)>n and 
(A19) with (A20) substituted into equation (A19) is used 
5 when p{g)<n. 

Note that when has a Jeffreys prior we have 

in one embodiment, t| =1/^ has an independent ^a^nma 
10 distribution with scale parameter b>0 and shape parameter 
k>0 so that the density of t| is 

Y(t^,bjc)=b-'(t.»''-'exp(4/b)/r(k) 

omitting subscripts to sirt^lify the notation, it can be 
shown that 

E{tMP}= (2k+l)/(2^ + p2) (A21) 

as follows : 
Define 

I(P,b4c)= /(t^)" t exp(.0.5p42)y(t^bJc)dt' 

0 

20 then 

I(p,b,kH)'^^{r(p+fc+0.5)r(k)}(l4<).5bp2)-<i'*^ 

Proof 

Let s^p^n then 

CO 

I(p.b4c)=b''«-^ /(t^/b)'»«'^ exp(-st2)y(t2,b4c)dt2 

0 

25 Now using the substitution u = tVb we get 
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I(p.b4c)=b'^^ /(u)"^^ e;q)(.sub>y(u,l Jk)du 

0 

Now let s'=bs and substitute the expression for y(u,1Jc) . This 

gives 

I(p,b,k)=b'^* J exp«s'+l)u)u'^^'du / r(k) 

0 

5 Looking up a table of Laplace transforms, eg Abramowitz and 
Stegun, then gives the result. 

The conditional eaqjectation follows from 

10 E{t^|p>=l(l,b,k)/r(0,b,k) 

= (2kH-l)/(2/b + p2) 

-AS k tends to zero and b tends to infinity we get the 
equivalent result using Jeffreys prior. For example, for 
•k=0.005 and b=2*lO^ 

E{t* IP}=(1.01)/(10^+P^) 

Hence we can get arbitrarily close to the Jeffreys prior 
with this proper prior. 

The algorithm for this model has 

where the expectation is calculated as above. 

In another embodiment, r,^ has an independent ^amma 

distribution with scale parameter b>0 and shape parameter . 
k>0. It can be shown that 
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CO 

/u''-^'^-'exp(.(y.,/u+u))du 

_0 

b Ju'-'^'-'expHY^/u +u))du 



- /2 1 ^3a.v(.2jy^) 

"Vb|P,|K,,,(2V^) <^22) 
_ 1 (2^/y~)K3^.,(2,/^) 

where y^=p^'/2b and K denotes a modified Beasel function. 
For k=l in equation (A22) 

For K=0.5 in equation (A22) 

E{<|P^}=>^(1/1PJ)(K,(2^)/K„(2^)} 
or equivalently 

E{<lft}=(l/lPJ^){2V?;-K,(2^)/Ko(2^)} 

Proof of (A.l) 

From the definition of the conditional e^ectation, writing 
7=P^/2b, we get 



Jz-V-'e3q)(-jrr-^)b-'(r-%)''-'exp(r-%)dz^ 

^ir' 

/r■'exp(-yr-')b■'(r-^^)'-•exp(T-^^)dT^ 

0 

Rearranging, simplifying and making the substitution u=r^/b 
gives the first equation in (A22) . 

The integrals in (22) can be evaluated by using the result 
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Where K denotes a modified Bessel function, see 
Watson (1966) . 

Bxanples of .en,bers of this class are k=i in which case 

which corresponds to the prior used In t-ho t 

TiKoi,- . / P-^-Lor used in the Lasso technique, 

Txbsharanx(i996). see also Figueiredo(2001). 
The case k=0.5 gives 

>/2^(l/l^^|)(K.(2^)/K„(2^)J 

or equivalently 

E{<IP,}=(l/|P,f)(27^K,(2Vir)/K„(2^)} 
where Ko and are modified Bessel f„no^-r 

and Stegun(l970) Polvnom' 1 f-^<=txons, see Abramowitz 

9 C1970} . Polynomial approximations for evaluatino 
these Bessel fiinctions can k« ^ ^ . <*J-uacing 
cti-^ ""ccions can be found in Abramowitz and 

ITI • - ^ ae™o„3.«te the 

con^ect^on „.th eh. x^sso 3^ ^^^^^ 



tLl".' tho.e Skilled in ehe ^ that as k 

oer treys iitproper prior. 

prior and '''^ ^-^so 

prior and the specification using Jeffreys hyper prior. 

The hype^,,^,^^^ ^ ^ ^^^^^^ ^ 

-ro for fxxea b the nu„u,er of components selected can be 
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decreased and conversely as k tends to 1 the number of 
selected components can be increased. 

in a preferred embodiment, the EM algorithm is performed as 
5 follows : 



15 



1. Set n=0 ,P,=/ and choose an initial value for y«>. Choose a 
value for b and k in equation (A22) . For example b=le7 and 
k=0 gives the Jeffreys prior model to a good degree of 
approximation. This is done by ridge regression of 
logCpWPia ) on where p,, is chosen to be near one for 
Observations in group g and a small quantity >o otherwise 
- subject to the constraint of all probabilities summing 
to one. 

2. Do the E step .i.e evaluate Q = Q(r\y,r). Note that this 
also depends on the values of k and b. 

3. Set t«0. For ^ = 1,...,G-1 calculate: 

^) /;=yr-r; using (A19) with (A20) substituted into 
20 (A19) when p(ff)^. 

(b) writing ^' =(^;,^ = l,.„,a-i) oo a line search to find the 
value of a' in r'*'=r'^oCSJ which maximises (or simply 
increases) (12) as a function of a' . 
c) set 7'*'=r' and t=t+l 
25 Repeat steps (a) and (b) until convergence. 

This produces ^-'say which maximises the current Q function 
as a function of y . 
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For g = U.G-l determine -S, =|y [^^max|/r|} 

Where e^l . say lo-. Befine so that /?^=0 for /e^^and 

This step eliminates variables with small coefficients from 
the model. 



4. Set n=:n+l and go to 2 



until convergence. 



A sec:ond embodiment relating to an ordered oat logistic 
regression will now be described. 

B. Ordered categories model 

The method of this embodiment may utilise the training 
san^les in order to identify a subset of components which 
can be used to determine whether a test sa^le belongs to a 
particular class. For example, to identify genes for 
assessing a tissue biopsy sample using mlcroarray analysis, 
n-croarray data from a series of samples from tissue that 
has been previously ordered into classes of increasing or 
decreasing disease severity such as normal tissue, benign 
tissue, localised tumour and metastasised tumour tissue are 
used as training samples to identify a subset of components 
which is capable of indicating the severity of disease 
associated with the training samples. Itoe subset of 
components can then be subsequently used to determine 
whether previously unclassified test samples can be 
classified as normal, benign, localised tumour or 
metastasised tumour. Thus, the subset of components is 
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diagnostic of whether a test sa^i. belongs to a particular 
Class Within an ordered set of classes. It will be apparent 
that once the subset of components have been identified 
only the subset of components need be tested in future 
5 diagnostic procedures to determine to what ordered class a 
sample belongs. 

The method of the invention is particularly suited for the 
analysis of very large amounts of data. Typically, large 
data sets obtained from test samples Is highly variable and 
often differs significantly from that obtained from the 
training samples. The «thod of the present invention is 
able to identify subsets of components from a very large 
amount of data generated from training samples, and the 
subset of components identified by the method can then be 
used to classifying test sanples even when the data 
generated from the test sample Is highly variable compared 
to the data generated from training samples belonging to the 
same class. Thus, the method of the invention is able to 
identify a subset of components that are more likely to 
Classify a sanple correctly even when the data is of poor 
^llty and/or there is high variability between samples of 
tHe same ordered class. 

^e components are "predictive" for that particular ordered 
class. Essentially, from all the data which is generated 
from the system, the method of the present invention enables 
xdantxfication of a relatively small nu.*er of components 
Which can be used to classify the training data. Once those 
components have been Identified by this method, the 
components can be used in future to classify test seniles. 
The method of the present invention preferably utilises a 
statistical method to eliminate components that are not 
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required to correctly classify the sample into a class that 
is a member of an ordered class. 

in the following there are N samples, and vectors such as y, 
2 and n have components y^, and for i = i, Vector 
multiplication and division is defined conponentwise and 
diag{ • } denotes a diagonal matrix whose diagonals are 
equal to the argument. We also use | | - | | to denote 
Euclidean norm. 
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Preferably, there are N observations /, where takes 

integer values 1 g. The values denote classes which are 

ordered in some way such as for example severity of disease 
Associated with each observation there is a set of 
covariates (variables, e.g gene expression values) arranged 
xnto a matrix X* with n row and p columns wherein n is the 
samples and p the components. The notation denotes the 
i"' row of X*. individual (sample) i has probabilities of 
belonging to class k given by ^«=;r,(A:;) . 

Define cumulative probabilities 

* 

>'a=5]^/* ' k = 1,...,G 

g-\ 

Note that is just the probability that observation i 
belongs to a class with index less than or equal to k. Let C 
be a N by p matrix with elements c, given by 

25 C. = / ^» if observation i in class j 

U (0. otherwise 

and let R be an n by P matrix with elements r„ given by 



20 



^=1 



These are the cumulative sums of the columns of C within 



rows . 
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For independent observations (samples) the likelihood of the 
data may be written as 



5 and the log likelihood L may be written as 

The continuation ratio ^odel „»y be adopted here as follows: 

for k = 2,..,G , see McCullagh and Nelder(l989) and 
10 Mccullagh(1980) and the discussion therein. Note that 

The likelihood is equivalent" to a logistic regression 
15 likelihood With response vector ^and covariate matrix X 

y =vec{R} 

where/,., is the G-1 by G-l identity matrix and 1^. is a G-l 
by 1 vector of ones. 

Here vec{ .} takes the matrix and forms a vector row by row. 

1 0 

Typically, as discussed above, the component weights are 
estimated in a manner which takes into account the a priori 
assumption that most of the component weights are zero 
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Following Figueiredo(2001), in order to eliminate redundant 
variables (covariates) , a prior is specified for the 
parameters ^* by introducing a p x 1 vector of 
hyperparameters . 

5 

Preferably, the prior specified for the component weights is 
of the form 



(Bl) 

10 Where i3 Jv(o,diag{.^}) and ^(v^) i. . 3uitably chosen 

hyperprior. For ex«,pie, p{v')a^p(vn is a suitable for™ of 
Jeffreys prior. 

In another embodiment, ^(v,^) is a prior wherein t/=l/v/ has 
15 an independent gamma distribution. 

In another embodiment n(v^\ ^ 

uimenc, p^y^ ) is a prior wherein vf has .an 

independent gamma distribution. 
20 The elements of theta have a non informative prior. 

• writing L[y\fi' o) for the likelihood function, in a Bayesian 

framework the posterior distribution ot e and . given v 

is 



25 



(2) 



0 



By treating v as a vector of missing data, an iterative 
algorithm such as an EM algorithm (Dempster et al, 1977) can 
be used to maximise (2) to produce maximum a posteriori 
estimates of p and 9. The prior above is such that the 
maximum a posteriori estimates will tend to be sparse i.e. 
if a large number of parameters are redundant, many 
components of p* will be zero. 

Preferably pT^^e\p*^) in the following: 

For the ordered categories model above it can be shown that 

f=^'^-*^) . (11) 

^< apr>= - x*diag{//(i-//)>x (12) 

Where //, = exp(x.T^)/(i+exp(x,V))and fi^ ==ie,,...,e^,fi^) . 

The iterative procedure for maximising the posterior 
distribution of the . components and component weights is an 
EM algorithm, such as, for example, that described i 
Dempster et al, 1977. Preferably, the EM algorithm 
performed as follows: 



m 
im Is 



1. Chose a hyperprior and values b and k for its parameters. 

set n=0. So = {l,2 p > , ^co, ^ ^ -^^^^^ 

regularisation parameter ;c at a value much greater than l, 
say 100. This corresponds to adding l//c' to the first G-l 
diagonal elements of the second derivative matrix in the M 
step below. 

If p ^ N compute initial values p* by 

P*=(X«XWX-g(y4g ^32, 



10 



- 48 - 

and if p > N compute initial values p* by 

Where the ridge parameter X satisfies 0 < A. ^ 1 and C is 
small and chosen so that the link function ^(z) = log(z/0-z)) 
well defined at z « 
2. Define 

p(n)^ I P,*, ieS„ 
0, otherwise 

and let P. be a matrix of zeroes and ones such that the 
nonzero elements. y<°> of p'-> satisfy 

Define = (vi'^„ » = , such that 



0, otherwise 



and let Wy=P^w^ 



15 3 . Perform the E step by calculating 

Q(P I pC^ q,<-0) = E{ logp(p. 9 . V I y) I y, p(»), 

= L(y I P, 9*">)-0.5 (II 0*w^)/d<">f ) ^^5) 
Where L is the log likelihood function of y and 
'^.'^O/^^l^rS^.^C^.^,,...,^^/ and for convenience we 
20 define 4 = 1/4 = 0 if^^O. . using P =P„r and p- =P„y. as) 
can be written as 



Q(Yly^9^">)=L(y|P„y,<p(»)).o.50|(yXyd(/^^^^^^ ) (B4) 
with d(r^''^)^prdM evaluated at =P,;.(-) . 
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4. Do the M step. This can be done with Newton Raphson 
iterations as follows. Set yo = y<°> and for r=0,l,2,.» y„i 
= Yr + a. 6, where is chosen by a line search algorithm to 
ensure Q(y^, | yW, ,p(-)) > Q(y | ^W, ^W) . 

5 For p ^ N use 



5 = 



'r 



where 



A(d-(y<->))[Y:V;'Y„4.I]-'(Y„^._!p.) ^^3, 



J*^, otherwise 
Y>A(d*(y(-)))P„Tx^ 

^ V;'=diag{//,(l-/i,)} 

Zr=(y-ii,) 

and = exp( XP„y,)/(H-exp( XP„y,)) . 
For p > N use 

8,=A(d-(r«))t, -Y:(Y.YJw,)- Y.](Y.\-:^) (B6) 

d (y^ ') 

With Vr and Zr defined as before. 

Let y* be the value of y, when some convergence criterion is 
satisfied e.g 

I I Yr - 7r+i| I < E (for example lO"^ ) . 

5. Define p* = P„y* , S^.={i^G: I P. | >m^(|PJ*s. ) } where is a 
small constant, say ie-5. Set n=n+i 

6. Check convergence. If J | y* . y(n) || < ^^^^^ 
suitably small then stop, else go to step 2 above. 



Recoyering the probabilities 




10 



15 
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once „e have obtained estin^tes of the parameters p are 
Obtained, calculate 



for 1=1, .„,N and k = 2,«.,g. 



Preferably, to obtain the probabilities „e use the recursion 



the tact that the probabilities sum to one, for 1 . 
In one embodiment the covariate matrix x 

With rows x.^ can be replaced b. a .atrix K with i,^^ element 

= - ) for some kernel function k. This 

tnatrax can also be augmented with . . 

examoi^ V , "S^nented wxth a vector of ones. Some 

example kernels are given in Tabl^ i k 

al (1999) ^ Evgeniou et 
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Table 1: Examples of kernel functions 
in Table l the last two kernels are preferably one 
^ensional i.e. for the case when X has only one coW. 

5 ZllTT ■^^^^'''^^ '^^^^^^ ot these 

5 ^ernel functions. The definition of B... can be found in De 
Boor (1978) . Use of a kernel f nr,o^ • 

whi^v, function results in mean values 

which are smooth (as ODt)oa*»rl *- 

^ °PP°sed to transforms of linear) 
functxons of the covariatea v e i. ^ 

ariaces X. Such models may give a 

substantially better fit to the data. 

A third embodiment relating to generali^^H i • 

win I. ^ generalised linear models 

will now be described. 

C. Generalised Linear Models 

15 

oo^rv. . interest. For example, a subset of 

components of a BKA .icroarray r^y be used to predict a 
clinically relevant characteristic . k P^e^ict a 

blood glucose level ^"^''^""^^ tor example, a 

glucose level, a white blood cell count, the size of a 
25 tumour, tumour growth rate or survival time. 

relatively small number of components ^ich can be used to 
■ predict a characteristic for a particular sanple. 
>0 selected components are -predictive" for that 

l"h"eT"'°- ^""""""^'"^ °' hyperparameters in 

the hyper prior the algorithm can be made to select subsets 
of varying sizes. Essentially from an / 

laiiy, rrom all the data which is 
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generated from the system, the method of the present 
xnventlon enables identification of a small number of 
components which can be used to predict a particular 
Characteristic. Once those con^nents have been identified 
S by this method, the components can be used In future to 
predict the characteristic for new sa»ples. The method of 
the present invention preferably utilises a statistical 
method to eliminate components that are not required to 
correctly predict the characteristic for the sample. 

The inventors have found that component weights of a linear 
combination of components of data generated from the 
trarnlng sa„i,les can be estimated in such a way as to 
eliminate the components that are not retired to predict a 

subset of components are identified which can correctly 
predict the characteristic for samples in the training set. 
The method of the present invention thus permits 

20 s^rr"T'°" ' "^''^ °' * relatively 

small number of components which are capable of correctly 
predrctlng a characteristic for a training sample, for 
example, a quantity of interest. 



25 



30 



on! el ! ""^ characteristic of interest. In 

in antht"^"' characteristic is a quantity or measure, 

m another embodiment, they may be the index number of a 
group, where the samples are grouped into two sample groups 

"^-^ °" " Pre-determlned classlfi:atl!n.T:e 
Classification ^y be any desired classification by which 
the training samples are to be grouped. For example, the 
Classification may be whether the training sables are from 
a leukemia cell or a healthy cell, or that the training 
samples are obtained from the blood of patients having or 
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are fron a cell from one of several types of cancer as 
con^red to a normal cell, m another embodiment the 

that par .cular patients have survived for at least a givL 
numher of days. Xn other en^diments the .entity may IZy 

ZZZZT" =--"«-tic of. the sa^le which is 

capable of measurement, foi. example blood pressure. 

10 In one embodiment, the data may be a ^entity y, . „here 

..{......AT}, we write the „xl vector with elements as y. „e 

define a p x 1 parameter vector p of cooponent weights <many 
of which are expected to be zero) .^^ 

zero) , and a q x 1 vector of 
parameters q> (not expected to be zero> 
IS zero ' ■ '^^'^ 1 could be 

zero (x.e. the set of parameters nbt expected to h. 
be empty) . expected to be zero may 

xn one embodiment, the input data is organised into an 
»xpdata matrix ^ = , ,,,, ^^^^^ ^^^^^^ ^ 

20. components, ^ically, ^ ^ ^^^^^ ^ 

in^other embodiment, data matrix ^ may be replaced by an n 
X n kernel matrix K to obtain * an n 

. ^° ototain smooth functions of x as 

predictors instead of linear predictors Ar. 
5c 1, , i'-t.eaicnors . An example of the 

::ripT:rx'^.:L:rr'^°-=*'^'-^^'''---'-' 
Ideally. .ubsetroTt^ ::L:r:h- h give 

sparse representations of these smooth functions. 

30 Typically, as discussed abov^ t-v.^ « 

estimat^H • component weights are 

estimated xn a manner which tak^« 

takes into accovmt the aoriorl 
assumption that most of i-v,^ apriori 
ac most of the component weights are zero. 



- 54 - 



5 



In one embodiment, the prior specified for the component 
weights is of the form: 

(CI) 

wherein v is a p x 1 vector of hyperparameters , and where 
p(fi is JV(0,diagp}) and p{v') is some hyperprior 
distribution for v^. 



10 A suitable form of hyperprior is, p(v')aflp(yf) . Jeffreys 
in another embodiment, the hyperprior p{v') is such that 



each tj'-l/vj^has an independent gamma distribution. 

In another embodiment, the hyperprior p{v') is such that 
15 each v/ has an independent gamma distribution. 

Preferably, an uninformative prior for <p is specified. 

The likelihood function is defined from a model for the 
20 distribution of the data. Preferably, in general, the 

likelihood function is any suitable likelihood function. For 
example, the likelihood function L(y\fi^) may be, but not 
restricted to, of the form appropriate for a generalised 
linear model (glm) , such as for example, that described by 
25 Nelder and Wedderbum (1972) . Preferably, in this case, the 
likelihood function is of the form: 
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(C2) 



10 



15 



20 



where y . (y,,.., ^ ^ ^^^^ ^ 

faxed set of taown weights « a single scale parameter. 

: ^-^«^ly. the livelihood function is specified as (ollows. 

We nave 

E{y,}=b'(e,.) 

Var{y>=b"(e,)a,(<p) = xfa,(q,) (C3) 

Each obse^ation has a set of covarlates x. and a linear 

Z T: ^ -^-^--^iP between the .ean of 

the X Observation and its linear predictor is given by the 
link fimction Vi = q(m ) - or k/ x . . 

, . ~ i>'<^i) ). The inverse of the 

link IS denoted by h, i.e 

Mi = h'(ei) = h(77i). 

in addition to the scale parameter, a generalised linear 
model may be specified by four components: 

• the likelihood or (scaled) deviance function, 

• the link function 

• the derivative of the link function 

• the variance function. 

Some common examples of aenerai -J i • 

. ^, ^ generalised linear models are given 

m the table below y ven 




Gaussian 



Binomial 



Derivative 
of link 
function 




Variance 
function 



Scale 
parame 

ter 
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Poisson 


log (n) 


1/ fi 




No 


Gamma 


1/ fi ' 


-1/ ■ 




Yes 


Inverse 
Gaussian 


' 1/ (X^ 


-2/ 




Yes 



in another e„*odi™ent, a ^asl likelihood model is specified 
Wherein only the link fu„etion and variance (unction axe 
defied in son« instances, such specification results in 
the .nodels in the tahle above. 1„ other instances, no ' 
distribution is specified. 

Xn one en4»di.ent, the posterior distribution of , and 
given y is estimated using.- 

wherein L{yjfi^) is the likelihood function. 



in one e„0.od..ent, v be treated as a vector of missing 
^ata and an xterative procedure used to n^xi^se elation 
(C4) to produce maximum a posteriori estimates of p The 
prior Of e<^tion (CI, is such that the maximum a posteriori 
est.m»tes will tend to be sparse i.e. if a large nu„*er of 
parameters are redundant, many components of p win be zero. 

A3 stated above, the component weights which maximise the 
posterior distribution may be determined using an iterative 
procedure. Preferable, the iterative procedure for 
maximising the posterior distribution of the components and 
component weights is an KM algorithm comprising an E step 
and an „ step, such as, for example, that described in 
Dempster et al, 1977. 



- 57 - 

in conducting the EM algorithm, the E step preferably 
comprises the step of computing terms of the form 



(C4a) 



where 4=<(A) = £{1/.;| fi,}-' and for convenience v« define 

d, = l/d,^Oifp, = 0. in the following we write d=d(^)=(d„d,,.,.Jjr , 

In a similar manner we define for example dC/S'-^) and 

d(y<''))=.lfd(Py^) where >ffW=i>y(») ^^d P i« ^k,. • ^ ^ 

H j-„r ana ^ xs obtained from the p 

by p identity matrix by omitting columns j for which ^j-)=0. 

Preferably, equation (C4a) is computed by calculating the 
conditional expected value of t,^^!^ when p[p\^^ ,s i^(o.vf) 
and p{v^) has a specified prior distribution. Specific 
examples and formulae will be given later. 

in a general embodiment, appropriate for any suitable 
ixkelihood function, the EM algorithm comprises the steps- 
(a) Selecting a hyperprior and values for its 

parameters. Initialising the algorithm by 
setting n=0, Sp » {1,2,.., p } , initialise (p«» 
. and applying a value for 8, such as for 
example e = lo"^; 
(b) Defining 

^ 0, otherwise ^^^^ 
and let Pn be a matrix of zeroes and ones such 
that the nonzero elements y<°' of p<°) satisfy 
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performing an estimation (E) step by calculating 
the conditional expected value of the posterior 
distribution of component weights using the 
function: 



Q(P I P<->, <p<">) = E{ logpO. (p , V I y) I y, pC), cp<->} 

=-L(y|P.<P^"^)-0.5>ff^A(£/(yS<">))-2;ff ^^^^ 

where L is the log likelihood function of y. 
Using p =p„y and d(y<->) as def ined in (C4a) , (C6) 
can be written as 

Q(ylY^cp«">)=L(y|P„y,q,(->)^.5;.rA(rf(/->)r^ ^^^^ 
performing a maximisation (M) step by applying 
an iterative procedure to maximise Q as a 
function of y whereby yo = y(">and for r=0,l,2, 
yr*i = Yr + Or 8r and where ar is chosen by a 
line search algorithm to ensure 
Q(y^.l/"\9^">) > Q(yjY(?>,q,<->)^ 

5,=A(d(Y<->))[.A(d(y<->))|LA(d(y(«>))H-ij-^ (C8) 

where : 

d(Y<"') = P,V(i»y"))as in (C4a) ; and 
for P,=P„y,. 
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(e) Let y* be the value of yr when some convergence 
criterion is satisfied, for example, | | yr - 
yr+1 1 I < 8 (for example 10"^ ) ; 

(f) Defining p* = P,y* , s^.={i: I >inax(|PJ*E, ) } 
where is a small constant, for example le-5. 



(9) 



10 



(h) 



Set n=n+l and choose q)*»*=^> = <p <°>+Kn( 9* - <p<°)) 
where 9* satisfies ^L(y|Py.«p) = 0 and k« is a 
damping factor such that 0< k„ ^ i; and 

Check convergence. If | | y* - y^^ | | < ^here 
is suitably small then stop, else go to step (b) 
above. 



In another embodiment, t? =1/^^ has an independent gamma 
distribution with scale parameter b>0 and shape parameter 
k>0 so that the density of t? is 

y(tf,b,k)=b->(t?/b)'-•exp(-t.?^)/^Ck) 
It can be shown that 

E{t*|p>= (2k+l)/(2/b + p2) 

as follows : 
Define 

oa 

J(p,hM)=f(t'y t exp(-0.5p^t*>y(t',bjc)dt2 

0 

then 

i(p,b,k)-bp^-^{r(p+k+o.5)/r(k)}(i-w 

Proof 
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Let s=fi^f2 then 



I(p,bjc)=b'^^ J(t'^)'-o^ exp(-st^)y(t^b4c)dt^ 
o 

Now using the substitution u = t^ /b we get 

I(p.b.k)=b>^' /(ur^ exp(.sub)y(u,l.k)du 

0 

5 NOW let s'=ba and substitute the expression (or r(„,U) . This 



gxves 



3 



«0 

I(p,b40=b'^-* J expHs'+l)u)u'^'^-^'du / r(k) 

0 

Looking up a table of Laplace transforms, eg.Abramowltz and 
Stegun, then gives the result. 

The conditional expectation follows from 

E{t^|P}=I(l,bJc)/[(0,b,k) 
= (2k+l)/(2/b + p2) 

AS k tends to zero and b tends to infinity we get the 
equivalent result using Jeffreys prior. For example, for 
k=0.005 and b=2*10^ 

E{t'l^}=(l.ol)/(lo-s^.p2) 

Hence we can get arbitrarily close to the algorithm with a 
Jeffreys hyperprior with this proper prior. 

in another embodiment, has an independent ^amma 
distr-ibutiox, With scale parameter b>0 and shape parameter 
K>0. It can be shown that 
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CO 

/u''-^-'exp(-(;i,/u+u))du 
b /u''-*^-'exp(-(;i,/u+u))du 

0 

where ^=p,^/2b and K denotes a modified Bessel function, 
which can be shown as follows; 
For k=l in equation (c9) 

E{v,-'|P,}=V27b(l/|p,|) 

For K=0.5 in equation (C9) 

E{vr^|ft>= >/2^(l/IPJ){K,(2V^)/Ko(2V^)} 
or equivalent ly 

-2 I 



Proof 



E{v,- I3.}=(1/IP,P){2V^K.(2V^)/k:„(2V^)) 



From^the definition of the conditional expectation, writing 
/2b , we get 



/v..-^i•'exp(-V.-')b-'(v,-='/b)'^-'exp(Vi■^^)dv.^ 



/vi■'exp(-^VJ-')b-•(v.-'^)''-•exp(v;='^)dv.' 
Rearranging, simplifying and making the substitution u=v.?/b 



gives A.l 



The integrals in A.l can be evaluated by using the result 
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where K denotes a modified Bessel function, 
Watson (1966) . 



Examples of members of this class are k=l in which case 

E{v,'|PJ=V27b(l/|ft|) 
which corresponds to the prior used in the Lasso technique, 
Tibshirani (1996) . See also Figueiredo(2001) . 

The case k=0.5 gives 

^{<m= >/2^(l/|ft|){K,(2V^)/K,(27^)} 
oar equivalently 

E{v,^|pJ=(l/|ftP){27^K,(2V^)/Ko(2V^)} . 
where Ko and K, are modified Bessel functions, see Abramowitz 
and Stegun(1970). Polynomial approximations for evaluating 
these Bessel functions can be found in Abramowitz and 
Stegun(1970, p379) . Details of the above calculations are 
given in the Appendix. 

The expressions above demonstrate the connection with the 
Lasso model and the Jeffreys prior model. 

It will be appreciated by those skilled in the art that as k 
tends to zero and b tends to infinity the prior tends to a 
Jeffreys inproper prior. . 

In one embodiment, the priors with 0< k ^1 and b>0 form a 
class of priors which might be interpreted as penalising non 
zero coefficients in a manner which is between the Lasso 
prior and the original specification using Jeffreys prior. 

in another embodiment, in the case of generalised linear 
models, step (d) in the maximisation step may be estimated 
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by replacing |il with its expectation E{^} . This is 

preferred when the model of the data is a generalised linear 
model . 



5 For generalised linear models the expected value Ef-^ 
be calculated as follows. Beginning with 

5p ^ \^an/^^^^ (CIO) 

0 where X is the N by p matrix with i'^'^ row Xi' and 
we get 



Equations (CIO) and (Cll) can be written as 



where V=A(aj (<1>K • 



Preferably, for generalised linear models, the EM algorithm 
comprises the steps: 

(a) Choose a hyper prior and its parameters. 

initialising the algorithm by setting n=0. So = 
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{1,2,.^, p } , q,<o) , applying a value for e, such as 
for example e = 10"=, and 
If p ^ N compute initial values p* by 

P=(X'X4-AD-'X-g(y+0 ^^^^j 

and if p > N compute initial values p* by 

P*=XCI -X^(XX^+XJ)-'X)X^g(y+0 (C15) 

where the ridge parameter X satisfies 0 < A, ^ i and C 
is small and chosen so that the link function is well 
defined at y+^ . 

(b) Defining 

p(>>)= I Pi > is S„ 
0, otherwise 

and let Pn be a matrix of zeroes and ones such 
that the nonzero elements y(n) of p (n) satisfy 

Y = P7P , P=p[y 

(c) performing an estimation (E) step by calculating 
the conditional e^^ected value of the posterior 
distribution of component weights using the 
function: 

QO I P^"\ q><">) = E{ logpO, (p , V I y) I y. p(°>, <p(")} 

= L(y I P, q)W) -0.5 JS"" A(d(J3^"^)y^ p ^ ^ 

where L is the log likelihood function of y. * 
Using p =P„Y and p<'» = P„y-> (cis) can be written 



as 



(d) 



Q(y I Y^"^ <P^"^) = US\ P„y, <p(-)>0.5 yr £,(d(y<''^yr^y (C17) 
performing a maximisation (M) step by applying an 
iterative procedure, for example a Newton Raphson 
iteration, to maximise Q as a function of y whereby 
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yo 



= y<"Und for r=0,l,2_ y,,, . ^ ^ 5^ ^j^^^^ 
is chosen by a line search algorithm to ensure 
Q(r.,|y<^9^'^) , Q(yJ/»>.<p<->)^ ^^^^^^ 
P :S N use 



where 



A(d(rC->))[Y;v;'Y„-.I]->(YX'.^-^) (CIS) 



X.=A(d(yC*))P7X 



V=A(a,(«p)xf(|0L)2) 

and the subscript r denotes that these quantities are 
evaluated at n = h(XP„y,). 
For p > N use 

With and 2r defined as before. 

Let y* be the value of y, when some convergence 
criterion is satisfied e.g 

I 1 Tr - yr+i| J < s (for example lo'^ ) . 

Define p* = p„y* , s„,=(i: | P, j >inax(|p, |*e, ) } where 
e^ is a small constant, say ie-5. Set n=n+l and 
choose <p- = <pn , . where satisfies 

g-L(y|P„r*,<P) = 0 and Kn is a damping factor such that 

0< K„ ^ 1. Note that in some cases the scale parameter 
IS known or this equation can be solve explicitly to 
get an updating equation for <p. 
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10 



15 



20 



25 



The above embodiments may be extended to incorporate quasi 
likelihood methods Wedderbum (1974) and McCullagh and 
Nelder (1983)). In such an embodiment, the same iterative 
procedure as detailed above will be appropriate, but with L 
replaced by a quasi likelihood as shown above and, for 
example. Table 8.1 in McCullagh and Nelder (1983). m one 
embodiment- there is a modified updating method for' the scale 
parameter <p. To define these models requires specification' 
of the variance function , the link function g and the 
derivative of the link function ^. Once these are defined 
the above algorithm can be applied. 

In one embodiment for quasi likelihood models, step 5 of the 
above algorithm is modified so that the scale parameter is 
updated by calculating . 



Where ^ and x are evaluated at p* = p^y* . Preferably, this 
updating is performed when the number of parameters s in the 
model is less than N. A divisor of N -s can be used when s 
is much less than N. 

in another embodiment, for both generalised linear models 
and Quasi likelihood models the covariate matrix X with rows 
Xi can be replaced by a matrix K with ij'^ element k,^ and 
kij = K(xi-xj) for some kernel function k . This matrix can 
also be augmented with a vector of ones. Some example 
kernels are given in Table 2 below, see Evgeniou et 
al(1999). 



30 



Kernel function 

Gaussian radial basis function 
Inverse multiquadric 



Multi layer perceptron 
Ploynomlal of degree d 



B splines 
Trigonometric polynomials 



Foitnula for k( x - y ) 

exp(-||x-y(|^/a).a>0 

(l|x-y||^ + c^)-i« 

(l|x-y|r+c^)>^ 



tanh( X y-e ) , for suitable 9 
(T+xyf 



B2n+i(x - y) 
sin(( d +1/2 Xx-y))/sin((x-y)/2) 



Table 2: Examples of kernel functions 

in Table 2 the last two kernels are one dimensional i.e. for 
the case when X has only one coluno.. Multivariate versions 
can be derived from products of these kernel functions. The 
defxnxtxon of B^n.x can be found in De Boor (1978 ) Use of a 
kernel function in either a generalised linear model or a 
quasi likelihood model results in mean values which are 
smooth (as opposed to transforms of linear) functions of the 
covariates X. Such models may give, a substantially better 
fit to the data. 

A fourth embodiment relating to a proportional hazards model 
Will now be described. 



D. Proportional Hazard Model 
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The method of this embodiment may utilise training samples 
in order to identify a subset of components which are 
capable of affecting the probability that a defined ev«nt 
(eg death, recovery) „iu occur within a certain time 
5 period. Training samples are obtained from , system and the 
time measured from when the training sample is obtained to 
when the event has occurred, using a statistical method to 
associate the time to the event with the data obtained from 

10 t "'"'""'^ °* - -*set of con^onents may 

10 be identified that are capable of predicting the 

distribution of the time to the event. Subsequently, 
taowledge of the subset of components can be used for tests 
for example clinical tests to predict for example, 
statistical features of the time to death or time to relapse 
IS of a disease. For example, the data from a subset of 
components of a system may be obtained from a DNa 
microarray. This data may be used to predict a clinically 
relevant event such as, for example, ejected or median 
patient survival times, or to predict onset of certain 
20 symptoms, or relapse of a disease. 

in this way, the present Invention identifies preferably a 
relatively small number of components which can be used to 
predict the distribution of the time to an event of a 

'.B system. ihe selected components are -predictive- for that 
time to an event. Essentially, from all the data which Is 
generated from the system, the method of the present 
invention enables Identification of a small number of 
components which can be used to predict time to an event 

0 C«ce those con»onents have been identified by this method, 
the components can be used in future to predict statistical 
. features of the tl^ to an event of a system from new 
samples. The method of the present invention preferably 



5 
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utilises a statistical method to eliminate cotr4)onents that 
are not required to correctly predict the time to an event 
of a system. By appropriate selection of the hyperparameters 
in the model some control over the size of the selected 
subset can be achieved. 

AS used herein, "time to an event- refers to a measure of 
the time from obtaining the sample to which the method of 
the invention is applied to the time of an event. An event 
may be any observable event. When the system is a 
biological system, the event may be, for example, time till 
failure of a system, time till death, onset of a particular 
symptom or symptoms, onset or relapse of a condition or 
disease, change in phenotype or genotype, change in 
biochemistry, change in morphology of an organism or tissue, 
change in behaviour. 

The samples are associated with a particular time to an 
event from previous times to an event. The times to an 
event may be times determined from data obtained from, for 
example, patients in which the time from sampling to death 
is known, or in other words, "genuine" survival times, and 
patients in which the only information is that the patients 
were alive when samples were last obtained, or in other 
words, "censored" survival times indicating that the 
particular patient has survived for at least a given number 
of days. 

in one embodiment, the input data is organised into an 
«x;,data matrix X = {x,) with „ test training samples and p 
components. Typically, p will be much greater than n. 
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For example, consider an Nxp data matrix jr = (ji:^) from, for 

example, a microarray experiment, with N individuals (or 
samples) and the same p genes for each individual. 
Preferably, there is associated with each individual 
I (i = l,2,'--,iS^) a variable j;^ { 3;,- >0) denoting the time to an 
event, for example, survival time. For each individual 
there may also be defined a variable that indicates whether 
that individual's survival time is a genuine survival time 
or a censored survival time. Denote the censor indicators 
as where 

{/, if yi is Iincensored 
(7, if J/,- is censored 

The NkI vector with survival times y^ may be written as y 
and the JVxl vector with censor indicators c^ as g. 

Typically, as discussed above, the component weights are 
estimated in a manner which takes into account the a priori 
assumption that most of the component weights are zero. 

Preferably, the prior specified for the component weights is 
of the form 

^(A. A.-.y^„)= J]l^(A|^i )p{r, )dT (Dl) 

where .^b are component weights, ^(AK) is//(o,T/) and 

P(r, ) is some hyperprior distribution 

that is not a Jeffrey's hyperprior. 
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In one embodiment, the prior distribution is an inverse 
gamma prior for t in which t? =l/zf has an independent gaxnma. 
distribution with scale parameter b>0 and shape parameter 
k>0 so that the density of t? is 

5 y(t?M)=b-'(tf/b)^-^exp(.t?^^^ . 

It can be shown that: 

E{t^|p}= (2k+l)/(2/b + P^) (A) 
Equation A can be shown as follows: 

10 

. Define 

so 

0 

then 

I(p,b.k>=b'^-*{r(p+lc+0.5)/r(k)}(l-K).5bp^)-<^-'> 

15 Proof 

Let 3 = 0^12 then 

I(p.b^)=b'^•*](t^^)^' e3q)(-st^Mt',b^)dt^ 

0 

Now using the substitution u = t^ /b we get 

I(p,b,k)=b'^^ /(u)*^^ exp(-sub)y(u,l,k)du 

0 

20 Now let s'=bs and substitute the expression for y(u,14c) . This 
gives 

CO 

I(p,b,k)=b'^' J expHs'+l^u-^-^'du / r(k) 

0 

Looking up a table of Laplace transforms, eg Abramowitz and 
Stegun, then gives the result. 

25 

The conditional expectation follows from 
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E{tMp}=I(l,b^)/I(0,bJc) 
= (2k+l)/(2^ + p2) 



As k tends to zero and b tends to infinity, a result 
equivalent to using Jeffreys prior is obtained. For example, 
for ksO.005 and b=:2*10^ 

E{t^ |p} = (1.01)/(10-'+p') 

Hence we can get arbitrarily close to a Jeffery's prior with 
this proper prior. 

The modified algorithm for this model has 
where the esqpectation is calculated as above. 

In yet another embodiment, the prior distribution is a gamma 
distribution for r? . Preferably, the gamma distribution- has 
scale parameter b>0 and shape parameter k>0. 

It can be shown, that 

GO 

Ju''-'''-'exp(-(Y,/u +u))du 
b Ju*^-"^'exp(-(Yj/u+u))du 

0 

Vb|PilK,«.,(2V^) 

I Pi I' ^V2.U2^) 

where Yi=p.^/2b and K denotes a modified Bessel fxmction. 
Some special members of this class are k=l in which case 

E{rr^lP,}=>^(l/IP,|) 
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which corresponds to the prior used in the Lasso technic[ue, 
Tibshirani (1996) . See also Figueiredo(2001) . 



The case k=0.5 gives 



or equivalently 



iPi}= aw ){2>/y;'k,(2^/;:)/k„(2V^)} 



where Ko and Ki are modified Bessel fimctions, see Abratnowitz 
and Stegun(1970) . Polynomial approximations for evaluating 
these Bessel functions can be found in Abratnowitz and 
Stegun{1970, p379) . 

The eaqjressions above demonstrate the connection with the 
Lasso model and the Jeffreys prior model. 

Details of the above calculations are as follows: 
For the gamma prior above and y,=p.^/2b 



ju''-'''-'exp(-(Y./u+u))du 

^{<"m=-^ 

bJu'=-"^-'exp(-(Y./u+u))du 



o 




(D2) 



where K denotes a modified Bessel function. 
For k=l in (D2) 



E{ri-'|Pi}=V2?b(l/|Pj|) 



For K=0.5 in (D2) 
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or equivalent ly 

Mr;' |Pi>- ){2^K,(2^)/Ko(2^)} 

Proof 

Prom the definition of the conditional expectation, writing 
Yi=PiV2b, we get 



JrJVexp(-y,TJ-=')b-'(TJ-'/b)•^'exp(7i-'^)dT,' 
E{rr^|ft}=^4 



/r,-'exp(-w'')b-'(r;%)'-'exp(T5^/b)dr,^ 

0 



Rearranging, simplifying and making the sxibstitution u==v?/b 
gives A.l 

The integrals in A.l can be evaluated by using the result 



0 

where K denotes a modified Bessel function, see 
Watson (1966) . 

In a particularly preferred embodiment, p{r,)a\/Tf is a 
Jeffreys prior, Kotz and Johnson (1983) . 

The likelihood function defines a model which fits the data 
based on the distribution of the data. Preferably, the 
likelihood function is of the form: 

AT 

Log (Partial) Likelihood = J]^,- (^.^;X,y,c) 

1=1 

where f =(Pi.^,-,Pp) and =(^^,(i?2,-.«'^)are the model 
parameters. The model defined by the Likelihood function 
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may be any model for predicting the time to an event of a 
system. 

In one embodiment, the model defined by the likelihood is 
Cox's proportional hazards model. Cox's proportional hazards 
model was introduced by Cox (1972) and may preferably be 
used as a regression model for survival data. In Cox's- 

proportional hazards model, ^^is a vector of (explanatory) 
parameters associated with the componentis . Preferably, the 
method of the present invention provides for the 
parsimonious selection (and estimation) from the parameters 

fi^ -(^'fil'"''fip) foa^ Cox's proportional hazards model given 
the data X , y and c . 

Application of Cox's proportional hazards model can be 
problematic in the circumstance where different data is 
obtained from a system for the same survival times, or in 
other words, tied survival times. Tied survival times may 
be subjected to a pre-processing step that leads to unique 
survival times. The pre-processing proposed simplifies the 
ensuing code as it avoids concerns about tied survival times 
in the sxabsequent application of Cox's proportional hazards 
model . 

The pre-processing of the survival times applies by adding 
an extremely small amount of insignificant random noise. 
Preferably, the procedure is to take sets of tied times and 
add to each tied time within a set of tied times a random 
amount that is drawn from a normal distribution that has 
zero mean and variance proportional to the smallest non-zero 
distance between sorted suarvival times. Such pre-processing 
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achieves an elimination of tied times without imposing a 
draconian perturbation of the survival times. 

The pre-processing generates distinct survival times. 
Preferably, these times may be ordered in increasing 



Denote by Zthe Nxp matrix that is the re -arrangement 
of the rows of X where the ordering of the rows of 
Z corresponds to the ordering induced by the ordering of / ; 
also denote by Zj the j'^^ row of the matrix Z . Let d be the 

result of ordering c with the same permutation required to 
order jf . 

After pre-processing for tied survival times is taken 
into account and reference, is made to standard texts on 
survival data analysis (eg Cox and Oakes, 1984), the 
likelihood function for the proportional hazards model may 
preferably be written as 



magnitude denoted as i^{\iy\2y t{N))' '(i+i) > '(,) • 




(D3) 



where fi^ = {^lh.p2. - .Pn) . = the 3^'^ row of Z, and 

9?y= {i.-i = j,y+l,.-.,Ar}=. the risk set at the f-^ ordered event 

time /(,). 



The logarithm of the likelihood (ie L = log{l)) may 
preferably be written as 
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N 
l-l 



f f 



N 



f 



JJ 



ZiP-log 



N 



Y,<i.jexp(ZjP) 

JJ 



where 



(D4) 



Notice that the model is non-paratnetric in that the 
parametric form of the survival distribution is not 
specified - preferably only the ordinal property of the 
survival times are used (in the determination, of the risk 
sets) , As this is a non-parametric case (p is not required 
(ie gf=0) . 



In another embodiment of the method of the invention, the 
model defined by the likelihood function is a Parametric 
survival model. Preferably, in a parametric survival model, 

P ±& a. vector of (explanatory) parameters associated with 

7* 

the components, and ^ is a vector of parameters associated 
with the functional form of the survival density function. 

Preferably, the method of the invention provides 

for the parsimonious selection (and estimation) from the 

parameters ^^and the estimation of (p^ =[g>i,q>2r->9q) for 

parametric survival models given the data X , y and c . 

In applying a parametric survival model, the survival times 
do not require pre-processing and are denoted as y . 

The parametic survival model is applied as follows: 



- 78 - 

Denote by f^^y;0,jS,x) the parametric density fianction of the 
survival time/ denote its survival function by 

00 

^(^'?'^'^)^ J/{tt;^,j^,A')rf«where ^ are the parameters relevant 
y 

to the parametric form of the density function and yff.ATare 
as defined above. The hazard function is defined as 
h[yi:<p.p.x) =^f{yt:<p,/3,x)/s[yi:<p.p.x) , 

Preferably, the generic formulation of the log-likelihood 
fvmction, taking censored data into account, is 
N . 

Reference to standard texts on analysis of survival time 
data via parametric regression survival models reveals a 
collection of survival time distributions that may be used. 
iSurvival distributions that may be used include, for 
example, the Weibull, Ea^onential or Extreme Value 
distributions . 

If the hazard function may be written as 

h{yi;<p.P.x) = x(yi;^)exp[Xifi)then S(yi:<p,p.x) = exp[-A[yi;tp)e^'^^ and 
f(yi''P'fi'^) = ^(yi;^)e^(j^ij3-A{yi)e^'^^ where ^(j'/;?)= is 

the integrated hazard ftinction and , Xi is 

^ dyi 

the i^^ row of X. 



The Weibull, Exponential and Extreme Value . distributions 
have density and hazard functions that may be written in the 
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form of those presented in the paragraph immediately 
above. 



The application detailed relies in part on an algorithm of 
Aitken and Clayton (1980) however it permits the user to 
specify any parametric xinderlying hazard fimction. 

Following from Aitkin and Clayton (1980) a preferred 
likelihood function which models a parametic survival model 
is: 



N 
i=l 



Cilog{Mi)-Mi+Ci 



J)\ 



{D5) 



where /i,= ^(j^/;^)exp(X/^) . Aitkin and Clayton (1980) note 
that a consequence of equation (11) is that the C/'s may be 
treatedas Poisson variates with means ///and that the last 
term in equation (11) does not depend on p (although it 
depends on ^?) . 



Preferably, the posterior distribution ot 0 , q> and r given 
y is 



wherein is the likelihood function. 



(D6) 



In one embodiment, r may be treated as a vector of missing 
data and an iterative procedure used to maximise equation 
(D6) to produce a posteriori estimates of fi . The prior of 
equation (Dl) is such that the maximum a posteriori 
estimates will tend to be sparse i.e. if a large number of 
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parameters are redtmdant, many components of will be 
zero. 



Because a prior expectation exists that many components of 
fi are zero, the estimation may be performed in such a way 
that most of the estimated pf' a are zero and the remaining 
non-zero estimates provide an adequate explanation of the 
survival times. 

In the context of microarray data this exercise translates 
to identifying a parsimonious set of genes that provide an 
adequate explanation for the event times. 

As stated above, the conponent weights which maximise the 
posterior distribution may be determined using an iterative 
procedure. Preferable, the iterative procedure for 
maximising the posterior distribution of the components and 
component weights is an EM algorithm, such as, for exartple, 
that described in Dempster et al, 1977. 

If the E step of the EM algorithm is examined, from (D6) 
ignoring terms not involving beta, it is necessary to 
compute 



i-l 

(D7) 



where d,=d,iP,) = E{\/vJ \ fi^)-^"" and for convenience we define 

rfi = l/rf, = Oif^, = 0. In the following we write d=d(^)=(d„d^,..jj . 

In a similar manner we define for example di/S^"^) and 
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d(r'"')=Pjd(Py') where fi""^Py' and P„ is obtained from the p 
by p identity matrix by omitting columns j for which ;^"^=0. 

.Hence to do the E step requires the calculation of the 
conditional expected value of ti=llTi when p[fi,\rf) is n(o,t^) 

and p^T^)ha.8 a specified prior distribution as discussed 
above. 

In one embodiment, the EM algorithm comprises the steps: 

1. Choose the hyperprior and values for its parameters, 
namely b and k. Initialising the algorithm by setting n=0, 
50= {1,2,.^, p } , initialise ^^'^ = fi\ ^C^) , 

2 . Defining 

0, Otherwise 

and let be a matrix of zeroes and ones such that 
the nonzero elements y^"^ of satisfy 



3. Performing an estimation step by calculating the expected 
value of the posterior distribution of component weights. 
This may be performed using the function: 

fiC^I^W,^^) = £{log(p(A.^.ri^))|3;,^W,^('')} 

^ Pf n V (D9) 



•) 
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where L is the log likelihood function of y . Using 
fi = P„y and = i>„y<"> we have 

4. Performing the maximisation step. This may be 
performed using Newton Raphson iterations as follows: 

Set ro=/''^and for r=0,l,2,.. y^^^ = + where 

a^is chosen by a line search algorithm to ensure 

G(r.., I /"Kv^"'^) > Q(n I /"U^"^) , and 

5.=A(rf0'W))[-A(rf(/''>))^A(J0'(->))+/]-«(-^ tr 

where ;^ = P/^ . _ = i>^_p„ for ^, = P„3., (Dll) 

Let ^* be the value of yy. when some convergence 
criterion is satisfied e.g < e (for example ff = iO-= ) 

5. Define P* =PnY\ 5'„=|i':|AI >5ii«flx|>0f,. || where is a small 
constant, say 10- = . Set n=n+l, choose ^(""^^J + -^C")! 

where ^ satxsfxes =0 ^nd k„ is a damping 
factor such that 0<ic„<l. 

6. Check convergence. If ||<^2 where is suitably 
small then stop, else go to step 2 above. 
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In another eitibodiment / step (Dll) in the maximisation step 



5 In one embodiment, the EM algorithm is applied to maximise 
the posterior distribution when the model is Cox's 
proportional hazard' s model • 

To aid in the exposition of the application of the EM 
10 algorithm when the model is Cox's proportional hazards 
models it is preferred to define ^^dynamic weights" and 
matrices based on these weights. The weights are - 



may be estimated by replacing — r — with its expectation 





15 



20 
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Matrices based on these weights are 



Wi 



W2 



lo 



01 



1=1 

In terms of the matrices of weights the first and 
second derivatives of L may be written as - 



(D12) 



where K=W -A\W j. Note therefore from the transformation 

matrix P„ described as part of Step (2) of the EM algorithm 
(Equation D8) (see also Equations (DID) it follows that 



(D13) 
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Preferably, when the model is Cox's proportional hazards 
model the E step and M step of the EM algorithm are as 
follows: 

1. Choose the hyperprior and its parameters b and k. Set 
n=0. So = {1,2,'-', p) . Let v be the vector with 
components 

Xe , if c,=0 

for some small e , say .001. Define f to be log(v/t). 
T.t p ^ N compute initial values fi* by 

^*=(Z'"Z+Af)''Z^f 
It p > N compute initial values ^*by 

where the ridge parameter X satisfies 0 < ;i ^ l, 
2 • Define 

0, othCTwise 

Let i^be a matrix of zeroes and ones such that the 
nonzero elements y^of fi^^^ satisfy 



^ pT^in) ^ pin) ^ pyn) 

3. Perform the E step by calculating 
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Qmt^) = ^{lOg(p(^,^,T|£))|£,^W} 

where L±s the log likelihood fiinction of jf given by 
Equation (8). Using fi = P„y and = Py"^ we have 

Q(y I /'b = L{i I /'„j:)-^r''A(</0'<»>)V 

4. Do the M step. This can be done with Newton Raphson 
iterations as follow^. Set = r^'"^ and for r=0,l,2,... 

Yr+\ = Yr where is chosen by a line search 

algorithm to ensure | /''\q>^''^)>Q^r^ | . 

For p ^ N use 

where r = Z;> ^(rf^^W >j. 
For .p > N use 

Let y* be the value of when some convergence 

criterion is satisfied e.g j | y, - y^^) | < s (for example 
10-^) . 



5. Define p* =P„y* , S„ = 



'•I A- I >eimax\/3j\\ where Si is a small 



constant, say 10"=. This step eliminates variables with 
very small coefficients. 

6. Check convergence. If \\r*-/"h\<e2 where Sz is suitably 
small then stop, else set n=n+l, go to step 2 above and 
repeat procedure until convergence occurs. 
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In another etnbodiment the EM algorithm is applied to 
maximise the posterior distribution when the model is a 
parametric survival model. 



5 In applying the EM algorithm to the parametic survival 

model, a consequence of equation (11) is that the ci's may- 
be treated as Poisson variates with means and that the 
last term in equation (11) does not depend on P (although 
it depends on qp) . 

10 Note that /o^(/i|) = /o^(^(>^/;^))+ JT/^ and so it is possible 

to couch the problem in terms of a log-linear model for the 
Poisson-like mean. Preferably, an iterative maximization of 
the log- likelihood function is performed where given initial 
estimates of ^ the estimates of p are obtained. Then given 
15 these estimates of p , updated estimates of are obtained. 
The procedure is continued until convergence occurs. 

Applying the posterior distribution described above, we 
note that (for fixed 4p) 

dp Mdfi^dfi ^ dp dp 
20 Consequently from Equations (11) and (12) it follows that 

The versions of Equation (12) relevant to the parametric 
survival models are 



drr ' dp^ 



(D15) 
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To solve for ^ after each M step of the EM algorithm 
(see step 5 below) preferably put 

?^"*^^ = «>^"'*'^^+«:„L*-^(")] where <p* satisfies — = 0 for 0<ic„<l 

and P is fixed at the value obtained from the previous M 
step. 

It is possible to provide an EM algorithm for parameter 
selection in the context of parametric survival models and 
microarray data. Preferably, the EM algorithm is as 
follows : 

1. Choose a hyper prior and its parameters b and k eg b=le7 
and k=0.5. Set n=0, So = {1,2,-", p} ^(*"'^'^) = ^(0) . Let v be 
the vector with components 

for some small e , say for example .001. Define f to be 
log (v/A (y, 9) ) . 

It p ^ N compute initial values fi' by fi* ={X^X+Xry^X^f . 
It p > N compute initial values /3* by 

where the ridge parameter A satisfies 0 < A ^ 1. 



2. Define 



0, otherwise 

Let ij,be a matrix of zeroes and ones such that the nonzero 
elements y^"^ot J3^"^ satisfy 
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3. Perform the E step by calculating 

where Lis the log likelihood function of y and . 
Using fi = P„Y and = i^/»> we have 

4. Do the M step. This can be done with Newton Raphson 
iterations as follows. Set y^^^y^^'^axiA for r=0^1,2,... 

= + where is chosen by a line search 

algorithm to ensure jg^/^^j | /"W^"^) >fi(rr I /"^^V^"^)* 

For p ^ N use 
^, = -A(J(3.<-^))[lf A(//)r„ +/r'(r/(£-/.)-A(^rf(y(-)))^) 
where r=AP„A(j(/''>)). 
For p > N use 

Let y* be the value of yr when some convergence 

criterion is satisfied e.g | | yr - yr+i| | < s (for example 
10-^ ) . 

5. Define 5'„=|i:|A | >5iwax|>5y || where Si is a small 
constant, say lO"*. Set n=n+l, choose ^^'''^^^ =^(")+js:„ -^(")] 
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. ^. dLly\P„r\ip] 
where q> satisfies — "^^ =0 and k„ is a damping 

factor such that 0<x:„<l. 

6. Check convergence. If ||<£^2 where £2 is suitably 

small then stop, else go to step 2. 

In another embodiment, survival times are described by 
a Weibull survival density fxanction. For the Weibull case 
<p is preferably one dimensional and 

<p=a 

Preferably, =^+2(c,-/i,.) 70^(3;,.) = 0 is solved 

after each M step so as to provide an updated value of a . 

Following the steps applied for Cox's proportional hazards 
model, one may estimate a and select a parsimonious sxibset 
of parameters from ^that can provide an adequate 
explanation for the survival times if the survival times 
follow a Weibull distribution. A numerical example is now 
given . 

The preferred embodiment of the present invention will now 
be described by way of reference only to the following non- 
limiting example. it should be understood, however, that 
the example is illustrative only, should not be taken in any 
way as a restriction on the generality of the invention 
described above. 



Example : 
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Full normal regression example 201 data points 41 basis 
functions 

k=0 and b=le7 

the correct four basis functions are identified namely 
2 12 24 34 

estimated variance is 0.67. 
With k=0.2 and b=le7 

eight basis functions are identified, namely 
2 8 12 16 19 24 34 

estimated variance is 0.63. Note that the correct set of 
basis functions is included in this set. 



The results of the iterations for k=:0.2 and b=le7 are given 
below. 



EM Iteration: 0 expected post: 2 basis fns 41 
Sigma squared 0.6004567 

EM Iteration: 1 expected post: -63.91024 basis fns 41 
Sigma squared 0.6037467 

EM Iteration: 2 expected post: -52.76575 basis fns 41 
Sigma squared 0.6081233 

EM Iteration: 3 expected post: -53.10084 basis fns 30 
Sigma squared 0.6118665 

EM Iteration: 4 expected post: -53.55141 basis fns 22 
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Sigma squared 0.6143482 

EM Iteration: 5 expected post: 



-53.79887 basis fns 18 



sigma squared 0.6155 
5 EM Iteration: 6 expected post: -53.91096 basis fns 18 

sigma squared 0.6159484 

EM Iteration: 7 expected post: -53.94735 basis fns 16 

10 Sigma squared 0.6160951 

EM Iteration: 8 esqpected post: -53.92469 basis fns 14 

sigma squared 0.615873 

EM Iteration: 9 expected post: -53.83668 basis fns 13 

15 

sigma squared 0.6156233 

EM Iteration: 10 expected post: -53.71836 basis fns 13 

signna squared 0.6156616 
0 EM Iteration: 11 expected post: -53.61035 basis fns 12 

sigma sc[uared 0,6157966 

EM Iteration: 12 expected post: -53.52386 basis fns 12 

5 sigma squared 0.6159524 

EM Iteration: 13 expected post: -53.47354 basis fns 12 

sigma squared 0.6163736 

EM Iteration: 14 expected post: -53.47986 basis fns 12 
) • • 

sigma squared 0.6171314 

EM Iteration: 15 expected post: -53.53784 basis fns 11 
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Sigma squared 0.6182353 

EM Iteration: 16 eaqpected post: 



-53.63423 basis fns 11 



Sigma squared 0.6196385 
5 EM Iteration: .17 expected post: -53.75112 basis fns 11 

Sigma squared 0.621111 

EM Iteration: 18 expected post: -53.86309 basis fns 11 

10 .Sigma squared 0.6224584 

EM Iteration: 19 expected post: -53.96314 basis fns 11 

Sigma squared 0.6236203 

EM Iteration: 20 expected post: -54.05662 basis fns 11 

15 

Sigma squared 0.6245656 

EM Iteration: 21 e3q>ected post: -54.1382 basis fns 10 

Sigma squared 0.6254182 
0 EM Iteration: 22 expected post: -54.21169 basis fns 10 

Sigma squared 0.6259266 

EM Iteration: 23 expected post: -54.25395 basis fns 9 

5 Sigma stjuared 0.6259266 

EM Iteration: 24 expected post: -54.26136 basis fns 9 

Sigma squared 0.6260238 

EM Iteration: 25 expected post: -54.25962 basis fns 9 

) 

Sigma squared 0.6260203 

EM Iteration: 26 expected post: -54.25875 basis fns 8 
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Sigma squared 0.6260179 

EM Iteration; 27 expected post: -54.25836 basis fns 8 

Sigma squared 0.626017 
5 EM Iteration: 28 expected post: -54.2582 basis fns 8 

Sigma squared 0.6260166 

Fore the reduced data set with 201 observations and 10 
10 variables, k=0 and b=ie7 

Gives the correct basis functions, namely 12 3 4. With 
k=0.25 and b=le7, 7 basis functions were chosen, namely 1 2 
3 4 6 8 9. A record of the iterations is given below. 
Note that this set also includes the correct set 

15 

EM Iteration: o expected post: 2 basis fns 10 
Sigma squared 0.6511711 

EM Iteration: i expected post: -66.18153 basis fns 10 

20 

Sigma squared 0.6516289 

EM Iteration: 2 expected post: -57.69118 basis fns 10 

Sigma squared 0.6518373 
25 EM Iteration: 3 expected post : -57.72295 basis fns 9 

Sigma squared 0.6518373 

EM Iteration: 4 expected post: -57.74616 basis fns 8 

30 Sigma squared 0.65188 

EM Iteration: 5 expected post: -57.75293 basis fns 7 

Sigma squared 0.6518781 
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Ordered categories examples 

IjUo prostate data 15 samples 9605 genes. For k«0 and b=le7 
we get the following results 

misclassif ication table 

pred 
y 1 2 3 4 

1 4 0 0 0 

2 0 2 1 0 

3 0 0 4 0 

4 0 0 0 4 

Identifiers of variables left in ordered categories model 
6611 

For k=0.25 and b«le7 we get the following. results 

misclassif ication table 

pred 
y 12 3 4 

1 4 0 0 0 

2 0 3 0 0 

3 0 0 4 0 

4 0 0 0 4 

Identifiers of variables left in ordered categories model 
6611 7188 

Note that we now have perfect discrimination on the training 
data with the addition of one extra variable. A record of 
the iterations of the algorithm is given below. 



- 96 - 

Iteration 1 : 11 cycles, criterion -4.661957 

misclassification matrix 
fhat 
5 f 1 2 

1 23 0 

2 0 22 

row =t2rue class 

10 Class 1 Number of basis functions in model : 9608 
********* If***************** ************** ****** 
Iteration 2 : 5 cycles, criterion -9.536942 

misclassification matrix 
15 fhat 
f 12 

1 23 0 

2 1 21 

row =t;rxie class 

20 

Class 1 Number of basis functions in model : 6431 
***********************************^t^f^t^,^,^^,^,^^^,^ 

Iteration 3 : 4 cycles, criterion -9.007843 

25 misclassification matrix 
fhat 
f 12 

1 23 0 

2 0 22 

30 row =true class 

Class 1 Number of basis functions in model : 508 
**********************************^,*4f*^,^,^i^^^,^^^^^ 
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Iteration 4 : 5 cycles, criterion -6.47555 

misclassif ication matrix 
fhat 
5 f 12 

1 23 0 

2 0 22 

row =true class 

10 ■ Class 1 Number of basis functions in model : 62 
************************************************ 
Iteration 5 : 6 cycles, criterion -4.126996 

misclassif ication matrix 
15 fhat 

f 12 

1 23 0 

2 1 21 • 
row =true class 

20 

Class 1 Number of basis functions in model : 20 

***************** ***-k*********ie*-k*ie************ 

iteration 6 : 6 cycles, criterion -3.047699 

25 misclassif ication matrix 
fhat 
f 1 2 

1 23 0 

2 1 21 

30 row =true class 

Class 1 Number of basis functions in model : 12 
*********************************************** 
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Iteration 7 : 5 cycles, criterion -2.610974 

misclassification matrix 
fhat 

5 f 12' 

1 23 0 

2 1 21 

row =true class 

10 Class 1 : Variables left in model 
12 3 408 846 6614 7191 8077 
regression coefficients 

28.81413 14.27784 7-025863 -1 . 086501e-06 4.553004e-09 - . 
16.25844 0.1412991 -0.04101412 

15 

it****************************** **************** 

Iteration 8 : 5 cycles, criterion -2.307441 

misclassification matrix 
20 . fhat 
f 1 2 

1 23 0 

2 1 21 

row =true class 

25 

Class 1 : Variables left in model 
12 3 6614 7191 8077 
regression coefficients 

32.66699 15.80614 7.86011 -18.53527 0.1808061 -0.006728619 

30 

*********************************************** 
Iteration 9 : 5 cycles, criterion -2.028043 
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misclassif ication matrix 

fhat 
f 12 

1 23 0 

2 0 22 

row =»true class 

Class 1 : Variables left in model 
12 3 6614 7191 8077 
regression coefficients 

36.11990 17.21351 8.599812 -20.52450 0.2232955 -0.0001630341 

Iteration 10 : 6 cycles^ criterion -1.808861 

misclassif ication matrix 

fhat 
f 1 2 

1 23 0 

2 0 22 

row =:true class 

Class 1 : Variables left in model 
12 3 6614 7191 8077 
regression coefficients 

39.29053 18.55341 9.292612 -22.33653 0.260273 -8 . 696388e-08 

********************** ********* 
Iteration 11 : 6 cycles, criterion -1.656129 



misclassif ication matrix 
fhat 
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1 23 0 

2 0 22 

row =true class 

Class 1 : Variables left in model 
1. 2 3 6614 7191 
regression coefficients 

42,01569 19.73626 9.90312 -23,89147 0.2882204 

Iteration 12 : 6 cycles, criterion -1.554494 

misclassif ication matrix 

fhat 
f 12 

1 23 0 

2 0 22 

row =true class 

Class 1 : Variables left in model 
12 3 6614 7191 
regression coefficients 

44.19405 20.69926 10.40117 -25.1328 0.3077712 
********* *****************^^^^^^^^^^^^^^^^^^^^^ 

Iteration 13 : 6 cycles, criterion -1.487778 

misclassif ication matrix 

fhat 
f 12 

1 23 0 

2 0 22 

row =true class 
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Class 1 Variables left in model 
12 3 6614 7191 
regression coefficients 

45.84032 21.43537 10.78268 -26.07003 0.3209974 

Iteration 14 : 6 cycles, criterion -1.443949 

tnisclassif ication matrix 

10 fhat 

f 12 

1 23 0 

2 0 22 

row =true class 

15 

Class 1 : Variables left in model 
12 3 6614 7191 
regression coefficients 

47.03702 21.97416 11.06231 -26.75088 0.3298526 

20 

*********************************************** 

Iteration 15 : 6 cycles, criterion -1.415 

tnisclassification matrix 

25 fhat 

f 12 

1 23 0 

2 0 22 

row =true class 

30 

Class 1 : Variables left in model 
12 3 6614 7191 
regression coefficients 
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47.88472 22.35743 11.26136 -27.23297 0.3357765 



************************** ftitir1c**1c*************1t 

Iteration 16 : 6 cycles, criterion -1.395770 

5 

misclassification matrix 

fhat 
f 12 
1 23 0 
10 2 0 22 

row =true class 

Class 1 : Variables left in model 
12 3 6614 7191 
15 regression coefficients 

48.47516 22.62508 11.40040 -27.56866 0.3397475 

************************** f,^l1ef,^c^,^,^,^t*******^,1|,^e^f^, 
Iteration 17 : 5 cycles, criterion -1.382936 

20 . 

misclassification matrix 

fhat 
f 1 2 
1 23 0 
25 2 0 22 

row =tirue class 

Class 1 : Variables left in model 
12 3 6614 7191 
30 regression coefficients 

48.88196 22.80978 11.49636 -27.79991 0.3424153 

******************************^,1,^f^t1,^,if^,^,^,^,^f^il,^^^^ 



- 103 - 

Iteration 18 : 5 cycles, criterion -1.374340 

tnisclassif ication matrix 
fhat 
5 f 12 

1 23 0 

2 0 22 

row .=true class 

10 Class 1 : Variables left in model 
12 3 6614 7191 
regression coefficients 

49.16029 22.93629 11.56209 -27.95811 0.3442109 

15 *********************************************** 
Iteration 19 : 5 cycles, criterion -1.368567 

misclassif ication matrix 
fhat 
20 f 12 

1 23 0 

2 0 22 

row =strue class 

25 Class 1 : Variables left in model 
12 3 6614 7191 
regression coefficients 

49.34987 23.02251 11.60689 -28.06586 0.3454208 

30 ************************** 

Iteration 20 : 5 cycles, criterion -1.364684 



misclassif ication matrix 
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fhat i 
f 1 2 

1 23 0 

2 0 22 
row sstrue class 

Class 1 : Variables left in model 
12 3 6614 7191 
regression coefficients 
10 49.47861 23.08109 11.63732 -28.13903 0.3462368 

************************** **-k ********** ******** 
Iteration 21 : 5 cycles, criterion -1.362068 

15 misclassif ication matrix 
■ fhat 
f 12 

1 23 0 

2 0 22 

20 row =true class 

Class 1 : Variables left in model 
12 3 6614 7191 
regression coefficients 
25 49.56588 23.12080 11.65796 -28.18862 0.3467873 

********************************************** ie 

Iteration 22 : 5 cycles, criterion -1.360305 

3 0 misclassif ication matrix 
fhat 
f 12 
1 23 0 




2 0 22 
row =t2rue class 
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Class 1 : Variables left in model 
12 3 6614 7191 
regression coefficients 

49.62496 23.14769 11.67193 -28.22219 0.3471588 

Iteration 23 : 4 cycles^ criterion -1.359116 

misclassif ication matrix 

• fhat 
f 12 

1 23 0 

2 0 22 

row sstrue class 

Class 1 : Variables left in model 
12 3 6614 7191 
regression coefficients 

49.6649 23.16588 11.68137 -28.2449 0.3474096 

Iteration 24 : 4 cycles, criterion -1.358314 

misclassification matrix 

fhat 
f 12 

1 23 0 

2 0 22 

row »true class 




5 
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Class 1 : Variables left in model 
12 3 6614 7191 
regression coefficients 

49.69192 23.17818 11.68776 -28.26025 0.3475789 

Iteration 25 : 4 cycles, criterion -1.357772 

misclassification matrix 
10 fhat 

f 1 2 

1 23 0 

2 0 22 
row =true class 

15 

Class 1 : Variables left in model 
12 3 6614 7191 
regression coefficients 

49.71017 23.18649 11.69208 -28.27062 0.3476932 

Iteration 26 .- 4 cycles, criterion -1.357407 

misclassification matrix 
25 fhat 

f 1 2 

1 23 0 

2 0 22 

row =true class 



20 



30 



. Class 1 : Variables left in model 
12 3 6614 7191 

regression coefficients 
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49.72251 23.19211 11.695 -28.27763 0.3477704 
Iteration 27 : 4 cycles, criterion -1.35716 

5 

misclassification matrix 

fhat 
f 12 
1 23 0 
10 2 0 22 

row =true class 



Class 1 : variables left in model 
12 3 6614 7191 
15 regression coefficients 

- 49.73084 23.19590 11.69697 -28.28237 0.3478225 



***********, ^ _ 

.^^ '*********** 

Iteration 28 3 cvcl^a 

J cycxes, criterion -1.356993 

2 0 



misclassification matrix 
fhat 

f 12 
1 23 0 . 
25 2 0 22 

row =true class 



Class 1 : Variables left in model 
12 3 6614 7191 
30 regression coefficients 

49.73646 23.19846 11,6983 -28.28556 0.3478577 
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Iteration 29 : 3 cycles, criterion -1.356881 

misclassification matrix 
fhat 
5 f 12 

1 23 0 

2 0 22 

row =true class 

10 Class 1 : variables left in model 
12 3 6614 7191 
regression coeffidients 

49.74026 23.20019 11.6992 -28.28772 0.3478814 
15 *********************************,*,,,,,^-^^^^^^ 



Iteration 30 : 3 cycles, criterion -i. 



356805 



misclassification matrix 
fhat 
20 f 12 

1 23 0 

2 0 22 

row =true class 



25 Class 1 : variables left in model 
12 3 6614 7191 
regression coefficients 

49.74283 23.20136 11.69981 -28.28918 0.3478975 



30 1 



misclassification table 
pred 
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y 12 3 4 

1 4 0 0 0 

2 0 3 0 0 

3 0 0 4 0 



Identifiers of variables left in ordered categories n,odel 
6611 7188 

10 Ordered categories example 

Luo prostate data 15 samples 50 genes. For k»0 and b=le7 we 
get the following- results 

misclassification table 
15 pred 

y 1 2 3 4 

1 4 0 0 0 

2 0 2 1 0 

3 0 0 4 0 
20 4 0 0 0 4 



Identifiers: of variables left in ordered categories model 

25 For k=0.25 and b=le7 we get the following results 

misclassification table 
pred 

y 12 3 4 

30 14 0 0 0 

2 0 3 0 0 

3 0 0 4 0 

4 0 0 0 4 
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Identifiers of variables left 
1 42 

5 A record of the iterations fdr 
below 



in ordered categories model 



k=0.25 and b=le7 is given 



10 



Iteration l : 19 cycles, criterion -( 



■0.4708706 



misclassification matrix 
fhat 

f 12 

1 23 0 

15 2 0 22 

row =true class 



Class 1 Number of basis functions in model ■ 53 

******-^******..*^^^^^^^^,,^^ ************* 

20 Iteration 2 : 7 cycles, criterion -1.536822 



misclassification matrix 
fhat 

f - 1 2 
25 1 23 0 

2 0 22 
row =true class 



Class 1 Number of basis functions in model - 53 
*****************,,,,,,,^^^^^^^^^^^^^^^^^^^^^^^ 

Iteration 3 : 5 cycles, criterion -2.032919 



misclassification matrix 
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fhat 
f 1 2 

1 23 0 

2 0 22 

5 row =true class 



Class 1 Number of basis functions in model : 42 
Iteration 4 : 5 cycles, criterion -2.132546 

10 

misclassification matrix 
fhat 

f 12 
1 23 0 
15 2 0 22 

row =true class 



Class 1 Number of basis functions in model • is 
20 Iteration 5 : 5 cycles, criterion -1.978462 



misclassification matrix 
fhat 

f 12 
25 1 23 0 

2 p 22 
row =true class 



Class 1 Number of basis functions in model : 13 
30 ***********************, 



Iteration 6 : 5 cycles, criterion -1.668212 



misclassification matrix 




10 
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fhat 
f 12 

1 23 0 

2 0 22 

5 row sstrue class 

Class 1 : Variables left in model 
1 2 3 4 10 41 43 45 
regression coefficients 

34.13253 22.30781 13.04342 -16.23506 0.003213167 0.006582334 
-0.0005943874 -3.557023 

Iteration 7 : 5 cycles, criterion -1.407871 



15 

misclassification matrix 
fhat 

f 12 
1 23 0 
20 2 0 22 

row «true class 

Class 1 : Variables left in model 
1 2 3 4 10 41 43 45 
25 regression coefficients 

36.90726 24.69518 14.6X792 -17.16723 1.1121723-05 5.949931.. 
06 -3.892181e-08 -4.2906 

*****************,,,,^,^^,^^^^^^^^^^^^^^^^^^^^^ 
30 Iteration 8 : 5 cycles, criterion -1.244166 



misclassification matrix 
fhat 
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f 12 

1 23 0 

2 0 22 

row =true class 

5 

Class 1 : Variables left in model 
1 2 3 4 10 45 
regression coefficients 

39.15038 26.51011 15.78594 -17.99800 1.125451e-10 -4.799167 

Iteration 9 : 5 cycles, criterion -1.147754 

misclassification matrix 
15 fhat 

f 12 

1 23 0 

2 0 22 

row =true class 

20 

Class 1 : variables left in model 
1 2 3 4 45 

regression coefficients 

40.72797 27.73318 16.56101 -18.61816 -5.115492 

25 

Iteration 10 : 5 cycles, criterion -1.09211 

misclassification matrix 
30 fhat 

f 1 2 

1 23 0 

2 0 22 



row =true class 
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Class 1 : Variables left in model 
1 2 3 4 45 
5 regression coefficients 

41.74539 28.49967 17.04204 -19.03293 -5.302421 

*********************^^^^^^^^^^^^^^^^^^^^^^^^^^ 
Iteration li : 5 cycles, criterion -1.060238 

misclassification matrix 
fhat 

f 1 2 
1 23 0 
• 2 0 22 

row =true class 



Class 1 ; variables left in model 
1 2 3 4 45 

regression coefficients 

42.36866 28.96076 17.32967 -19.29261 -5.410496 
Iteration 12 : 5 cycles, criterion -1.042037 



misclassification matrix 
fhat 

f 1 2 

1 23 0 

2 0 22 

row =true class 

Class 1 : Variables left in model 
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1 2 3 4 45 

regression coefficients 

42.73S08 29.23176 17.49811 -19.44894 -5.472426 



5 ***************************^^^^ 



***** 



Iteration 13 : 5 cycles, criterion -1.031656 

misclassification matrix 
fhat 
10 f 12 

1 23 0 

2 0 22 
row =true class 

15 Class 1 : Variables left in model 
1 2 3 4 45 

regression coefficients 

42.95536 29.38894 17.59560 -19.54090 -5.507787 

Iteration 14 : 4 cycles, criterion -1.025738 

misclassification matrix 
fhat 

25 f 1 2 

1 23 0 

2 0 22 
row =true class 

30 Class 1 , Variables left in model 
1 2 3 4 45 

regression coefficients 

43.08034 29.47941 1-7 cc-, 

47941 17.65163 -19.59428 -5.527948 
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■ Iteration 15 : . 4 cycles, criterion -1,022366 

5 misclassification matrix 
fhat . 

f 12 

1 23 0 

2 0 22 

10 row =true class 

Class 1 : Variables left in model 
1 2 3 4 45 

regression coefficients 
15 43.15213 29.53125 17.68372 -19.62502 -5.539438 

Iteration 16 : 4 cycles, criterion -1.020444 

20 misclassification matrix 
fhat 
f 12 

1 23 0 

2 0 22 

25 row s=trxie class 

Class 1 : Variables left in model 
1 2 3 4 45 

regression coefficients 
30 43.19322 29.56089 17.70206 -19.64265 -5.545984 

Iteration 17 : 4 cycles, criterion -1.019349 
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misclassification matrix 
fhat 

f 12 
5 1 23 0 

2 0 22 
row =true class 

Class 1 : variables left in model 
10 1 2 3 4 45 

regression coefficients 

43.21670 29.57780 17.71252 -19.65272 -5.549713 

15 Iteration 18 : 3 cycles, " criterion -l . 018725 

misclassification matrix 
fhat 

f 1 2 
20 1 23 0 

2 0 22 
row =trTae class 



Class 1 : Variables left in model 
25 1 2 3 4 45 

. regression coefficients 

43.23008 29.58745 17.71848 -19.65847 -5.551837 
30 Iteration 19 : 3 .cycles, criterion -1.01837 

misclassification matrix 

fhat 




- 118 - 

f 1 2 

1 23 0 

2 0 22 

row =true class 

5 

Class 1 : Variables left in model 
1 2 3 4 45 

regression coefficients 

43.23772 29.59295 17.72188 -19.66176 -5.553047 



10 



Iteration 20 : 3 cycles, criterion -1, 



018167 



misclassif ication matrix 
15 fhat 

f 12 

1 23 0 . 

2 0 22 

row =true class 

20 

Class 1 : Variables left in model 
1 2 3 4 45 

regression coefficients 

43.24208 29.59608 17.72382 -19.66363 -5.553737 



25 



Iteration 21 : 3 cycles, criterion -l.o 



18052 



misclassification matrix 
30 fhat 

f 12 

1 23 0 

2 0 22 



row =true class 



- 119 - 



Class 1 : variables left in model 
1 2 3 4 45 
5 regression coefficients 

43.24456 29.59787 17.72493 -19.66469 -5.55413 



10 



*********** **ii.^r****4,ir 

Iteration 22 : 3 cycles, criterion -l., 



017986 



misclassification matrix 
fhat 

f 12 
1 23 0 
15 2 0 22 

row =true class 



Class 1 : Variables left in model 
1 2 3 4 45 
20 regression coefficients 

43.24598 29.59889 17.72556 -19.6653 -5.554354 



is misclassification table 
pred 
y 12 3 4 

1 4 0 0 0 

2 0 3 0 0 
0 3 0 0 4 0 

4 0 0 0 4 
Identifiers of variables left 
1 42 



in ordered categories model 




10 
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THE CLAIMS DEFINING THE INVENTION ARE AS FOLLOWS: 

1. A method for identifying a subset of components of a 
system from data generated from the system from a plurality 
of sanples from the system, the subset being capable of 
predicting a feature of - a test sarrple, the method comprising 
the steps of; *- a 

(a) generating a linear combination of component, and 
coit5>onent weights in which values for each component 
are introduced from data generated from a plurality of 
training samples, each training sample having a known 
feature ; 

(b) defining a .odel for the probability distribution of a 
feature wherein the model is conditional on the linear 

-^^ combination; 

<o) constructing a prior distribution for the component 
weights of the linear combination coi^rlsing a 
hyperprior having a high probability density close to 
20 <d. •'^rprlor.ls not a Jeffreys prior, 

(d) co-i'^nxng the prior distribution and the model to 
generate a posterior distribution; 

(e) identifying a subset of components having component 
weights that maximise the posterior distribution 



25 



2. The method as claimed in claim 1, wherein a 
mathematical equation in the form of a likelihood function, 
provxdes the probability distribution based on the data 
obtained from the training samples. 

30 3 The method as claimed in claim 1 or 2, wherein the 

model xs a likelihood function based on a model selected 
from the group consisting of a multinomial or binomial 
logistic regression, generalised linear model, Cox's 
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proportional hazards model, accelerated failure and 
parametric survival model. 

4. The. method as claimed in claim 2, wherein the 
5 likelihood function is based on a multinomial or binomial 
logistic regression. 

5 The method as claimed in claim 4, wherein the binomial 

10 ZlT regression models a feature having a 

10 multinomial or binomial distribution. 



15 



6. The method as claimed in claim 4 
likelihood function based on the legist 
the form: 



ox 5, wherein the 
ic regression is of 



n 

/-I 



r 






I' 


G-l 

n 






1 






\\ 













wherein 



20 



^P. is a linear combination generated t.o. input data fron 
tracing sample i with coni,onent weights fi„ 
4 is the components for the 1" Row of X and is a set of 
component weights for sample class g, and 

K is data from n training samples comprising p components. 

25 7 The method as claimed in claim 4, 5 or wherein the 
IxkeUhood function is based on an ordered categorical 
logistic regression. 
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8. The method as claimed in claim 7, wherein the 
likelihood function based on the categorical ordered 
logistic regression is of the form: 



^ <^^f f N***! -rat 

=nn(f- 



5 Wherein 

is the probability that training sample i bslongs- to a 
Class With identifier less than or equal to * (where the 
total of ordered classes is ic) . 

10 9. The method as claimed in claim 1, wherein the 

likelihood function is based on a generalised linear model. 

10. The method as claimed in claim 9, wherein the ' 

IS TTT^^ preferably models a feature that is 

15 dastrxbuted as a regular exponential family of 
distributions . 



11. The .»thod as Claimed in claim lo. wherein the regular 
exponential family of distributions include normal 
distribution, guassian distribution, poisson distribution, 
ganma distribution and inverse gamma distribution. 

12. The method as claimed m any one of claims 9 or ll 
Wherein the generalised linear model is of the form= ' 



25 



L=logp(y| p, q,)=i;{MzM)+^ ) J 
/=! a^(<p) ^' ' 



Where y = (y.,„, y.,. ^ . ^ ^.^^ ^ 

fixed set Of known weights and # a single scale parameter. 



I 
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13. 



The method as claimed in any one of the preceeding 
claims, wherein the method can be used to predict the time 
to an event for a sample by utilising a likelihood function 
5 based on a hazard model which preferably estimates the 

probability of a time to an event given that the event has 
not taken place at the time of obtaining the data. 

14. The method as claimed in claim 13, wherein the 
10 likelihood function is based on a model selected from a 
group consisting of Cox's proportional hazards model, 
parametric survival model and accelerated failure times 
model . 



20 



15 15. The method as oXateed In any one of claln-s 13 or 14 
wharain a subset of components capable of predicting the ' 
time to an event for a sanple is identified by defining a 
ixkelxhood based on Cox's proportional standards model, a 
parametric survival model or an accelerated survival times 
model. Which conprises measuring the time elapsed for a 
Plurality Of samples from the time the sa,^le is obtained to 
the time of the event. 

16. The method as claimed in any one of claims 13 to 15, 

25 wherexn the likelihood function for predicting the time to 

an event n a rs-p t-u-^ c 



an event is of the form: 

N 



where f and =(^.^...,^^)are the model 

parameters, y is a vector of observed times and c is an 
30 andxcator vector which indicates whether a time is a true 
survival time or a censored survival time. 
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17. The method as claimed in claim 14, wherein the 
ixkelxhood function, when based on Cox's proportional 
hazards model, is of the form: 



10 



exp(Zj^) 



2 ^{Zifi) 



Where the observed ti„es are be ordered in increasing 
-snitude denoted as £=(vHw,-/(„)), ^ 

Zthe ^ ^trix that is the re-arrange-r^nt o£ the rows 
Of X Where the ordering of the rows of Zcorresponds to the 
ordering induced by the ordering of , . and =(A.A,....A) . 

Zj = the row of z, and 91= di- ,-..1 

J V-^-J'J+h—.N}= the risk set 

at the ordered event time t(j) . 



15 18 The method as claimed in claim 14, wherein the 

Ixkelxhood function, when based on the Parametric Survival 
model, is of the form: 



20 



N 



f 



where «=^(^,.g)e^(jr,^) and A denotes the integrated 

^ran^tric hazard function, wherein the conponent weights 

iTwhlTh ^ ^^^^^'^ -atistical ™odeZ 

Which a posterior distribution of the component weights 
xs formulated which co„*>ines the liicelihood function and a 



prior distribution. 
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19- The method as claimed in claim 18. wherein the prior 
distribution is of the form: 

Wherein v Is a p x . vector of hyperpara^eters, and where 
Pi^y) is ;.(0.di.g{.^}) and ie .o„e hyperprlor 

distribution for v^. 

20. The method as claimed in claim 19, wherein the 
hjperprior is an inverse ga™. distribution in which each 

-1/v, has an independent ganma distribution. 

21. The method as cleaned in claim 15, wherein the 
hyperprior is a gamma distribution in which each 

v> ,TfOrT, (depending on the conh<=.-sft- ^ v.=« • , 
distribution. ^ "dependent gamma 

22. The method as claimed in any one o( claim 18 to 22 
wherexn the prior distribution and the liKelihood ,un=ti;n 

are combined to generate r.«=*. • ^. 'unccion 
the r.^ . . generate a posterior distribution, wherein 

the posterior distribution is preferably of the form: 



or 
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wherein L(y\^,^) is the likelihood funct 



ion. 



5 



23. The method as clai,„ed in claim 23, wherein the 
component weights in the posterior distribution are 
estimated in an iterative procedure such that the 
probability density of the posterior distribution Is 



maximised. 



24. The method as claimed in claim 23. wherein the 
Iterative procedure Is an EM algorithm. 

c'lalms^^"'*"^ ^ °' ^'•^ Preceeding 

claxms Wherein systems to which the method of the present 

IS chT T " '"""^^ 

fiZcT/"""" agricultural systems, weather systems, 
financial systems including credit -ri^Xr = 

. _ ^ credxt risk assessment systems, 

insurance systems, marketing systems or conpany record 
systems, electronic systems, physical systems, astrophysics 
systems and mechanical systems. Pnysxcs 



20 



25 



30 



Claims'" " ^'^'"^ °' ^-^^ preceeding 

Claims wherexn the method is particularly suitable for use 
in analysis of biological systems. 

27. The method as claimed in claim 26. wherein the 
biological system is a biotechnology array. 

28. The method as claimed in claim 27, wherein the 
biotechnology array includes oligonucleotide arrays, DKA 
arrays «a microarrays, RKA arrays, r„a microarxays, 
-crochaps, ^ microchips, protein arrays, p„tein 
-crochips, antibody arrays, chemical arrays, carbohydrate 
arrays, proteomics arrays, lipid arrays. 
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29. The method as claimed in claim 26, wherein the 

components may be, for ex-;9inr>-) 

^ ' example, genes or portions thereof, 

DNA sequences, RNA sequences, peptides, proteins 
carbohydrate molecules, lipids or mixtures thereof 
physiological components, anatomical components 
epidemiological components or chemical components. 

30 A method for identifying a subset of conponents of a 
subject Which are capable of classifying the subject into 
one Of a plurality of predefined groups wherein each group 

steps :: ^ ^ ^ ^^^^ — - - 



15 (a) 



(b) 
(c) 

20 



31 



a^slns a plurality of subjects to the test treatment 

and grouping the subjects into response groups based on 

responses to the treatment,- 

measuring components of the subjects- 

Identifying a subset of components that Is capable of 

Classifying the subjects into response groups using a 

statistical analysis method. 



The method as claimed in claim 30, wherein the 
statistical analysis, method includes the method defined in 
2S any one of claims 1 to 29. 

32. An apparatus for identifying a subset of components of 
a subject, the subset being capable of classifying the 
subject into one of a plurality of predefined response 
groups Wherein each response group is formed by exposing a 
Plurality Of subjects to a test treatment and growing L 
subjects into response groups based on the response t! tt 
treatment, the apparatus comprising, 




5 



10 
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(a) 



means for receiving measured components of the 
siibjects; 

(b) means for identifying a subset of components that is 
capable of classifying the subjects into response 
groups using a statistical analysis method. 



33 



The apparatus as claimed in claim 32, wherein the 
statxstical analysis method includes the method defined in 
any one of claims 1 to 3i. 



34 A method for identifying a s^set of conponents of a 
subject which are capable of classifying the subject as 
beang responsive or non-responsive to treatment with a test 
15 compound comprising the steps of : . 

(a) e:cposing a plurality of subjects ' to the co,^ound and 
grouping the subjects into response groups based on 
each subjects response to the compound, 
20 (b) measuring coii«>onents of the subjects- 

(c) identifying a subset of consonants that is capable of 
Classifying the subjects into response groups using a 
statistical analysis method. 

25 35 The apparatus as claimed in claim 34, wherein the 

statistical analysis method Includes the method defined in 
any one of claims i to 29. 



30 



a sub . "-tifying a subset of con^nents of 

a sub3ect. the subset being capable of classifying the 
subject into one of a plurality of predefined response 
groups Wherein each response group is formed by easing a 
Plurality of subjects to a compound and grouping the 
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subjects into response groups based on the response to the 
con5)ound, the apparatus comprising; 

(a) means for receiving measured components of the 
subjects; 

(b) means for identifying a subset of components that is 
capable of classifying the subjects into response 
groups using a statistical analysis method. 



37. 



The apparatus as claimed In claim 36, wherein the 
statistical analysis method includes the method defined in 
any one of claims 1 to 31. 

38. The method as claimed in claim 36 or 37, wherein the 
compound is a pham«ceutical compound or a composition 

IZITT ' ""^^^""-^ a pharmaceutically 

acceptable carrier. 

Tsvst"^ r"""' lO-tifying a subset of components of 
a system from data generated from the system from a 
Plurality of samples from the system, the subset being 
capable of predicting a feature of a test sample, the 

apparatus comprising: 

(a) means for QeneT•;5^^ nrr i • 

generating a linear combination of oon^onents 
and component weights in which values for each 
co^onent are introduced from data generated from a 
Plurality of training samples, each training senile 
having a known feature; 

(b) means for defining a model for the probability 
distribution Of a feature wherein the model is 
conditional on the linear combination; 



(c) 



15 



30 



(d) 
(e) 
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means for constructing a prior distribution for the 

component weights of the linear coordination comprising 

an adjustable hyperprior which allows the prior 

probability mass close to ze-ro • ^ , 

CO zero to be varied wherein the 

hypeiprior is not a Jeffrey's hyperprior; 
means for combining the prior distribution and the 
model to generate a posterior distribution- 
means for identifying a subset of components having 
component weights that maximize the posterior 
distribution. 

40. A computer program arranged, when loaded onto a 
co^uting apparatus, to control the costing apparatus- to 
xn^len^nt a „.thod aa defined in any one o* clai™ l to 2,. 

41. A computer readable medium providing a computer 
program in accordance with the claim 34 or 35. 

42. A method Of testing a sample from a system to Identify 

tes'lT: " '"^ - 

testrng for a suhset of components which is diagnostic of 

^ a method in accordance with any one of the claims l to 
IS a biological system. 



44 An apparatus for testing a sample from a system to 

rLTthf TT '""^ ""^-^^^ in accordance 

wxth the method defined In any one of claims 1 to 31 
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45. A computer program which when run on a computing 
device, is arranged to control the computing device, in a 
method of identifying components from a system which are 
capable of predicting a feature of a test sample from the 
system, and wherein a linear combination of components and 
component weights is generated from data generated from a. 
plurality of training samples, each training sample having a 
known feature, and a posterior distribution is generated by 
combining a prior distribution for the component weights 
comprising an adjustable hyperprior which allows the 
probability mass close to zero to be varied wherein the 
hyperprior is not a Jeffrey's hyperprior, and a model that 
is conditional on the linear combination, to estimate 
component weights which maximise the posterior distribution. 

46. A method for identifying a subset of components of a 
biological system, the stabset being capable of predicting a 
feature of a test sample from the biological system, the 
method comprising the steps of; 

(a) generating a linear combination of components and 
component weights in which values for each component 
are determined from data generated from a plurality of 
training samples, each training sample having a known ' 
f eatxire ; 

(b) defining a model for the probability distribution of a 
featxire wherein the model is conditional on the linear 
combination; 

(c) constructing a prior distribution for the component 
weights of the linear combination comprising an 
adjustable hyperprior which allows the probability mass 
close to zero to be varied; 
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(d) combining the prior distribution and the model to 
generate a posterior distribution; 

(e) identifying a subset of components having component 
weights that maximize the posterior distribution. 

DATED this 26'* day of May 2003 
CSIRO 

By their Patent Attorneys 
GRIFFITH HACK 



